Coders Packet

Health Insurance premium prediction in Python using scikit-learn

By Kotha Sai Narasimha Rao

The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.

Dataset link:

About the dataset:

age: age of the primary beneficiary

sex: insurance contractor gender, female, male

BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.


Prerequisites :


Step -1 :

Import the necessary packages  

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Short Description about the important libraries :

->Pandas is used to create and manipulate the data frames.

->Numpy is used to create and manipulate the arrays

->Matplotlib is used for visualization purposes.

Data visualization and data preprocessing :

Step 1:

import the dataset from the local folder


Step 2:
Output :

Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Step 3:
Checking the Null values in the dataset.

Output :

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
dataset does not contain null values.

Step 4:
Plotting the classes that are present in the 'region' Column with respect to their frequency.
sns.catplot(x='region', kind="count", data= data);

Step 5:

Plotting the classes that are present in the 'sex' Column with respect to their frequency.
sns.catplot(x='sex', kind="count", data= data);

Step 6:

Plotting the classes that are present in the 'smoker' Column with respect to their frequency.

sns.catplot(x='smoker', kind="count", data= data);

Step 7:

Creating the numerical plot for the continuous variables that are present in the dataset.

def num_plot(x,c='g'):
    print('MIN: ',data[x].min())
    print('MIN: ',data[x].max())
    print('MIN: ',data[x].mean())

Step 8:

Maximum users age  are between 18 and 22


The minimum age is 18.

The maximum age is 64

The mean age is 39.20702541106


Step 9 :


This variable shows the normal distribution.

The minimum BMI is 15.96.

The maximum BMI is 53.13.

The mean BMI is 30.66339686.




Step 10 :


Most of the users are unmarried.

The minimum no of children is 0.

The maximum no of children is 5.

The mean no of children is 1.0949177.


Step 11:


Most of the charges are below 100000

The minimum charge is 1121.8739.

The maximum charge is 63770.42801.

The mean charge is 13270.422265.


Step 12 :

Step 13 :




Step 14 :

Plotting the BMI who has greater than paying the 50000.


Step 15 :

Plotting the age who has greater than paying the 50000.


STEP 16:

Checking if there is a relationship any between the mean insurance of smokers and non smokers.

mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean()
mean_smo, mean_non_smo

Output :

(32050.23183153284, 8434.268297856204)

We can conclude from above that the mean insurance charge for smokers is more as they are more likely to die early as compared to non smokers.
Step 17 :
Checking for gender as well.
mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean()
mean_male, mean_female

Output :

(13956.751177721893, 12569.578843835347)

We can conclude from above that the mean insurance charge for males and females is almost the same.

Step 18 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean()
mean_bmi_large, mean_bmi_normal

Output :

(13946.476035324473, 10284.290025182185)

We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases


Step 19 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean()
mean_young, mean_old

Output ;

(9673.316908395263, 15773.351087515843)

Step 20 :
Finding the correlation between the variables by using a heat map.
sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)


Feature Selection
The effect of the children and region columns on the output variable is very less so remove those columns from the dataset.
data = data.drop(['children','region'], axis = 1)

Encoding the categorical variables that are present in our dataset.

cat_var = ['sex','smoker']
data['sex'] = pd.get_dummies(data['sex'], sparse=True)
data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)

 Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".

def coladd (age,smoker,bmi):
    if age>35 and smoker ==1 and bmi >25:
        return 1
        return 0

data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)



Output :

0    819
1    519
Name: High_Risk, dtype: int64.

Check if the mean insurance amount of these individuals are high compared to rest or not.
mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean()
mean_high_risk, mean_not_high_risk

Output :

(11226.504362427744, 14565.65229140293).

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Let's build one more by ignoring the age and considering only BMI and smoker.
def coladd1 (smoker,bmi):
    if smoker ==1 and bmi >25:
        return 1
        return 0

data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)


Checking the same analysis for medium risk individuals

mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean()
mean_med_risk, mean_not_med_risk

Output :

(8629.589609712157, 21954.55547444206)

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

dividing the dataset into two parts into X and y.
X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']]
Y = data['charges']

X-Dependent Variables.

y-Independent variables.


Splitting the dataset for training and testing the model.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)



Model Building

Linear regression :

Training the Model by using the training dataset.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression(),y_train)


Predicting the results.

y_pred_linear = regressor.predict(X_test)

Calculating the mean squared error and variance

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred_linear))
print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))

Output :

Mean squared error: 30637726.75
Variance score: 0.79.

Visualizing the results using scatter plot.
plt.scatter( y_pred_linear, y_test,  color=['red'])


Model Building using the lasso regression.

Train the lasso regression model using the training set.

from sklearn.linear_model import LassoCV
lasso_eps = 0.0001
model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5),y_train)
LassoCV(cv=5, eps=0.0001, max_iter=10000, n_alphas=20, normalize=True)

To get the variable importance

Output :

[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')]
Predicting the test set results.
y_predited_lasso = model_lasso.predict(X_test)

Visualizing the results obtained from the lasso regression.

plt.xlabel("Predicted Insurance")
plt.ylabel("Actual Insurance")

Both models are performing well on the test dataset. we can finalize any model for deployment.