Coders Packet

Predictive modelling using XGBoost in Python

By NIKETH ANNAM

In this tutorial, we will learn how to build a predictive model using XGBoost in Python. This process involves data preparation, normalization, and predicting the employee attrition rate.

Attrition refers to the gradual loss of employees over time. Lots of companies hire several employees every year and invest time and money on training those employees. Relatively attrition of these employees is a huge loss for the companies. Hence predicting employee attrition saves a lot to the company. Let us see how to predict employee attrition. The dataset used in this tutorial is taken from HackerEarth Challenge and can be downloaded here.

Data preparation(Load, clean) and Analysis

Firstly we will import the required libraries

import numpy as np
import pandas as pd
import os
import zipfile

Creating the path and extracting the zip file

path = '/content/drive/My Drive/datasets/3f488f10aa3d11ea.zip'
with zipfile.ZipFile(path) as zip_ref:
  zip_ref.extractall()

Since the data collected is from the competition there are three csv files, train, test, and submission. We will be using only the train.csv file in this tutorial.

train_df = pd.read_csv('/content/Dataset/Train.csv')
train_df.shape

Output:

(7000, 24)
train_df.columns

Output:

Index(['Employee_ID', 'Gender', 'Age', 'Education_Level', 'Relationship_Status', 'Hometown', 'Unit', 'Decision_skill_possess',
'Time_of_service', 'Time_since_promotion', 'growth_rate', 'Travel_Rate', 'Post_Level', 'Pay_Scale', 'Compensation_and_Benefits',
'Work_Life_balance', 'VAR1', 'VAR2', 'VAR3', 'VAR4', 'VAR5', 'VAR6', 'VAR7', 'Attrition_rate'], dtype='object')

Checking if there are any null values in the data

train_df.isnull().sum()

We found that the columns Age, Time of service, Worklife balance, VAR2, and VAR4 have some null values.

We will replace these nan values with the required values as shown below,

# Creating a dataframe by grouping the Time of sevice and taking the mean values of age
tos_age = pd.DataFrame(train_df.groupby(['Time_of_service'])['Age'].mean()) 
for i in range(0,7000):
  if train_df.at[i,'Age'].astype(str) == 'nan': 
    value = train_df.at[i,'Time_of_service'] # Taking the value corresponding to the age column
    # If the corresponding value is also 'nan'
    if value.astype(str) == 'nan': 
      train_df.at[i,'Age'] = 'nan'
    else:
      train_df.at[i,'Age'] = tos_age.at[value,'Age']
# Creating a dataframe by grouping the Age and taking the mean values of time of service
age_tos = pd.DataFrame(train_df.groupby(['Age'])['Time_of_service'].mean())
for i in range(0,7000):
  if train_df.at[i,'Time_of_service'].astype(str) == 'nan':
    value = train_df.at[i,'Age']
    if value.astype(str) == 'nan':
      train_df.at[i,'Time_of_service'] = 'nan'
    else:
      train_df.at[i,'Time_of_service'] = age_tos.at[value,'Time_of_service']
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())
train_df['Time_of_service'] = train_df['Time_of_service'].fillna(train_df['Time_of_service'].mean())
train_df['Work_Life_balance'] = train_df['Work_Life_balance'].fillna(2.0)
train_df['VAR2'] = train_df['VAR2'].fillna(train_df['VAR2'].mean())
train_df['VAR4'] = train_df['VAR4'].fillna(2)
train_df['Pay_Scale'] = train_df['Pay_Scale'].fillna(6.0)

Since employee ID has no relation to the attrition, we will drop that column

train_df = train_df.drop('Employee_ID', axis=1)
train_df.shape

Output:

(7000, 23)

Now we will create dummies for all the columns with object type and remove those columns

for col in train_df.columns:
  if train_df[col].dtype == 'O':
    dummies = pd.get_dummies(train_df[col])
    train_df[dummies.columns] = dummies

for col in train_df.columns:
  if train_df[col].dtype == 'O':
    train_df = train_df.drop(col, axis=1)

train_df.shape

Output:

(7000, 47)

Dividing the features for the model to train

df = train_df.drop(['Attrition_rate'], axis=1)
target = train_df['Attrition_rate']

Data Normalization

Normalizing the data using MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
scaled_feat = pd.DataFrame(data= scaled, columns=df.columns)

Data Modelling

Importing required libraries

import xgboost as xgb
from xgboost.sklearn import XGBRegressor
from sklearn import metrics

Defining the function for the model to fit, train and predict

def modelfit(alg, train_df, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(train_df[predictors].values, label=target.values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics='rmse', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    # Fit the algorithm on the data
    alg.fit(train_df[predictors],target,eval_metric='rmse')
    # Predict the train set    
    train_df_predictions = alg.predict(train_df[predictors])
        
    print ("\nModel Report")
    print ("RMSE :", (metrics.mean_squared_error(target.values, train_df_predictions))**0.5)  
predictors = [x for x in df.columns]

xgb1 = XGBRegressor(booster='dart',
                    learning_rate =0.1,
                    n_estimators=35,
                    max_depth=2,
                    min_child_weight=1,
                    gamma=1,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    objective= 'reg:squarederror',
                    nthread=4,
                    scale_pos_weight=1,
                    seed=40)

modelfit(xgb1, scaled_feat, predictors)

Output:

Model Report 
RMSE : 0.18588951725259212

Summary

The root means squared error value obtained for the model is 0.1859 which is a good value showing that the model predictions are accurate. Further feature engineering can be done to increase the model accuracy.

Download Complete Code

Comments

No comments yet