By NIKETH ANNAM
In this tutorial, we will learn how to build a predictive model using XGBoost in Python. This process involves data preparation, normalization, and predicting the employee attrition rate.
Attrition refers to the gradual loss of employees over time. Lots of companies hire several employees every year and invest time and money on training those employees. Relatively attrition of these employees is a huge loss for the companies. Hence predicting employee attrition saves a lot to the company. Let us see how to predict employee attrition. The dataset used in this tutorial is taken from HackerEarth Challenge and can be downloaded here.
Firstly we will import the required libraries
import numpy as np import pandas as pd import os import zipfile
Creating the path and extracting the zip file
path = '/content/drive/My Drive/datasets/3f488f10aa3d11ea.zip' with zipfile.ZipFile(path) as zip_ref: zip_ref.extractall()
Since the data collected is from the competition there are three csv files, train, test, and submission. We will be using only the train.csv file in this tutorial.
train_df = pd.read_csv('/content/Dataset/Train.csv') train_df.shape
Index(['Employee_ID', 'Gender', 'Age', 'Education_Level', 'Relationship_Status', 'Hometown', 'Unit', 'Decision_skill_possess',
'Time_of_service', 'Time_since_promotion', 'growth_rate', 'Travel_Rate', 'Post_Level', 'Pay_Scale', 'Compensation_and_Benefits',
'Work_Life_balance', 'VAR1', 'VAR2', 'VAR3', 'VAR4', 'VAR5', 'VAR6', 'VAR7', 'Attrition_rate'], dtype='object')
Checking if there are any null values in the data
We found that the columns Age, Time of service, Worklife balance, VAR2, and VAR4 have some null values.
We will replace these nan values with the required values as shown below,
# Creating a dataframe by grouping the Time of sevice and taking the mean values of age tos_age = pd.DataFrame(train_df.groupby(['Time_of_service'])['Age'].mean()) for i in range(0,7000): if train_df.at[i,'Age'].astype(str) == 'nan': value = train_df.at[i,'Time_of_service'] # Taking the value corresponding to the age column # If the corresponding value is also 'nan' if value.astype(str) == 'nan': train_df.at[i,'Age'] = 'nan' else: train_df.at[i,'Age'] = tos_age.at[value,'Age']
# Creating a dataframe by grouping the Age and taking the mean values of time of service age_tos = pd.DataFrame(train_df.groupby(['Age'])['Time_of_service'].mean()) for i in range(0,7000): if train_df.at[i,'Time_of_service'].astype(str) == 'nan': value = train_df.at[i,'Age'] if value.astype(str) == 'nan': train_df.at[i,'Time_of_service'] = 'nan' else: train_df.at[i,'Time_of_service'] = age_tos.at[value,'Time_of_service']
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean()) train_df['Time_of_service'] = train_df['Time_of_service'].fillna(train_df['Time_of_service'].mean()) train_df['Work_Life_balance'] = train_df['Work_Life_balance'].fillna(2.0) train_df['VAR2'] = train_df['VAR2'].fillna(train_df['VAR2'].mean()) train_df['VAR4'] = train_df['VAR4'].fillna(2) train_df['Pay_Scale'] = train_df['Pay_Scale'].fillna(6.0)
Since employee ID has no relation to the attrition, we will drop that column
train_df = train_df.drop('Employee_ID', axis=1)
Now we will create dummies for all the columns with object type and remove those columns
for col in train_df.columns: if train_df[col].dtype == 'O': dummies = pd.get_dummies(train_df[col]) train_df[dummies.columns] = dummies for col in train_df.columns: if train_df[col].dtype == 'O': train_df = train_df.drop(col, axis=1) train_df.shape
Dividing the features for the model to train
df = train_df.drop(['Attrition_rate'], axis=1) target = train_df['Attrition_rate']
Normalizing the data using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler()
scaled = scaler.fit_transform(df) scaled_feat = pd.DataFrame(data= scaled, columns=df.columns)
Importing required libraries
import xgboost as xgb from xgboost.sklearn import XGBRegressor from sklearn import metrics
Defining the function for the model to fit, train and predict
def modelfit(alg, train_df, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50): if useTrainCV: xgb_param = alg.get_xgb_params() xgtrain = xgb.DMatrix(train_df[predictors].values, label=target.values) cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, metrics='rmse', early_stopping_rounds=early_stopping_rounds) alg.set_params(n_estimators=cvresult.shape) # Fit the algorithm on the data alg.fit(train_df[predictors],target,eval_metric='rmse') # Predict the train set train_df_predictions = alg.predict(train_df[predictors]) print ("\nModel Report") print ("RMSE :", (metrics.mean_squared_error(target.values, train_df_predictions))**0.5)
predictors = [x for x in df.columns] xgb1 = XGBRegressor(booster='dart', learning_rate =0.1, n_estimators=35, max_depth=2, min_child_weight=1, gamma=1, subsample=0.8, colsample_bytree=0.8, objective= 'reg:squarederror', nthread=4, scale_pos_weight=1, seed=40) modelfit(xgb1, scaled_feat, predictors)
RMSE : 0.18588951725259212
The root means squared error value obtained for the model is 0.1859 which is a good value showing that the model predictions are accurate. Further feature engineering can be done to increase the model accuracy.