Coders Packet

Health Insurance premium prediction in Python using scikit-learn

By Kotha Sai Narasimha Rao

The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.

Dataset link:

About the dataset:

age: age of the primary beneficiary

sex: insurance contractor gender, female, male

BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.


Prerequisites :


Step -1 :

Import the necessary packages  

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Short Description about the important libraries :

->Pandas is used to create and manipulate the data frames.

->Numpy is used to create and manipulate the arrays

->Matplotlib is used for visualization purposes.

Data visualization and data preprocessing :

Step 1:

import the dataset from the local folder


Step 2:
Output :

Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Step 3:
Checking the Null values in the dataset.

Output :

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
dataset does not contain null values.

Step 4:
Plotting the classes that are present in the 'region' Column with respect to their frequency.
sns.catplot(x='region', kind="count", data= data);

Step 5:

Plotting the classes that are present in the 'sex' Column with respect to their frequency.
sns.catplot(x='sex', kind="count", data= data);

Step 6:

Plotting the classes that are present in the 'smoker' Column with respect to their frequency.

sns.catplot(x='smoker', kind="count", data= data);

Step 7:

Creating the numerical plot for the continuous variables that are present in the dataset.

def num_plot(x,c='g'):
    print('MIN: ',data[x].min())
    print('MIN: ',data[x].max())
    print('MIN: ',data[x].mean())

Step 8:

Maximum users age  are between 18 and 22


The minimum age is 18.

The maximum age is 64

The mean age is 39.20702541106


Step 9 :


This variable shows the normal distribution.

The minimum BMI is 15.96.

The maximum BMI is 53.13.

The mean BMI is 30.66339686.




Step 10 :


Most of the users are unmarried.

The minimum no of children is 0.

The maximum no of children is 5.

The mean no of children is 1.0949177.


Step 11:


Most of the charges are below 100000

The minimum charge is 1121.8739.

The maximum charge is 63770.42801.

The mean charge is 13270.422265.


Step 12 :

Step 13 :




Step 14 :

Plotting the BMI who has greater than paying the 50000.


Step 15 :

Plotting the age who has greater than paying the 50000.


STEP 16:

Checking if there is a relationship any between the mean insurance of smokers and non smokers.

mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean()
mean_smo, mean_non_smo

Output :

(32050.23183153284, 8434.268297856204)

We can conclude from above that the mean insurance charge for smokers is more as they are more likely to die early as compared to non smokers.
Step 17 :
Checking for gender as well.
mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean()
mean_male, mean_female

Output :

(13956.751177721893, 12569.578843835347)

We can conclude from above that the mean insurance charge for males and females is almost the same.

Step 18 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean()
mean_bmi_large, mean_bmi_normal

Output :

(13946.476035324473, 10284.290025182185)

We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases


Step 19 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean()
mean_young, mean_old

Output ;

(9673.316908395263, 15773.351087515843)

Step 20 :
Finding the correlation between the variables by using a heat map.
sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)


Feature Selection
The effect of the children and region columns on the output variable is very less so remove those columns from the dataset.
data = data.drop(['children','region'], axis = 1)

Encoding the categorical variables that are present in our dataset.

cat_var = ['sex','smoker']
data['sex'] = pd.get_dummies(data['sex'], sparse=True)
data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)

 Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".

def coladd (age,smoker,bmi):
    if age>35 and smoker ==1 and bmi >25:
        return 1
        return 0

data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)



Output :

0    819
1    519
Name: High_Risk, dtype: int64.

Check if the mean insurance amount of these individuals are high compared to rest or not.
mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean()
mean_high_risk, mean_not_high_risk

Output :

(11226.504362427744, 14565.65229140293).

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Let's build one more by ignoring the age and considering only BMI and smoker.
def coladd1 (smoker,bmi):
    if smoker ==1 and bmi >25:
        return 1
        return 0

data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)


Checking the same analysis for medium risk individuals

mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean()
mean_med_risk, mean_not_med_risk

Output :

(8629.589609712157, 21954.55547444206)

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

dividing the dataset into two parts into X and y.
X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']]
Y = data['charges']

X-Dependent Variables.

y-Independent variables.


Splitting the dataset for training and testing the model.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)



Model Building

Linear regression :

Training the Model by using the training dataset.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression(),y_train)


Predicting the results.

y_pred_linear = regressor.predict(X_test)

Calculating the mean squared error and variance

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred_linear))
print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))

Output :

Mean squared error: 30637726.75
Variance score: 0.79.

Visualizing the results using scatter plot.
plt.scatter( y_pred_linear, y_test,  color=['red'])


Model Building using the lasso regression.

Train the lasso regression model using the training set.

from sklearn.linear_model import LassoCV
lasso_eps = 0.0001
model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5),y_train)
LassoCV(cv=5, eps=0.0001, max_iter=10000, n_alphas=20, normalize=True)

To get the variable importance

Output :

[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')]
Predicting the test set results.
y_predited_lasso = model_lasso.predict(X_test)

Visualizing the results obtained from the lasso regression.

plt.xlabel("Predicted Insurance")
plt.ylabel("Actual Insurance")

Both models are performing well on the test dataset. we can finalize any model for deployment.

Download Complete Code


No comments yet