Coders Packet

Health Insurance premium prediction in Python using scikit-learn

By Kotha Sai Narasimha Rao

The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.

Dataset link:https://www.kaggle.com/mirichoi0218/insurance

About the dataset:

age: age of the primary beneficiary

sex: insurance contractor gender, female, male

BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.

 

Prerequisites :

 

Step -1 :

Import the necessary packages  

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Short Description about the important libraries :

->Pandas is used to create and manipulate the data frames.

->Numpy is used to create and manipulate the arrays

->Matplotlib is used for visualization purposes.

Data visualization and data preprocessing :

Step 1:

import the dataset from the local folder

data=pd.read_csv(r'D:\Codespeedy\Medical-Insurance-Prediction-master\insurance.csv')
data.head()

Step 2:

data.info()
Output :

Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Step 3:
Checking the Null values in the dataset.
data.isna().sum()

Output :

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
dataset does not contain null values.

Step 4:
Plotting the classes that are present in the 'region' Column with respect to their frequency.
sns.catplot(x='region', kind="count", data= data);

Step 5:

Plotting the classes that are present in the 'sex' Column with respect to their frequency.
sns.catplot(x='sex', kind="count", data= data);

Step 6:

Plotting the classes that are present in the 'smoker' Column with respect to their frequency.

sns.catplot(x='smoker', kind="count", data= data);

Step 7:

Creating the numerical plot for the continuous variables that are present in the dataset.

def num_plot(x,c='g'):
    plt.figure(figsize=(16,8))
    sns.distplot(data[x],color=c)
    plt.show()
    print(10*'----',x,10*'----')
    print('MIN: ',data[x].min())
    print('MIN: ',data[x].max())
    print('MIN: ',data[x].mean())
    
    

Step 8:

Maximum users age  are between 18 and 22

num_plot('age')

The minimum age is 18.

The maximum age is 64

The mean age is 39.20702541106

 

Step 9 :

num_plot('bmi')

This variable shows the normal distribution.

The minimum BMI is 15.96.

The maximum BMI is 53.13.

The mean BMI is 30.66339686.

 

 

 

Step 10 :

num_plot('children')

Most of the users are unmarried.

The minimum no of children is 0.

The maximum no of children is 5.

The mean no of children is 1.0949177.

 

Step 11:

num_plot('charges')

Most of the charges are below 100000

The minimum charge is 1121.8739.

The maximum charge is 63770.42801.

The mean charge is 13270.422265.

 

Step 12 :
AGE VS CHARGES
plt.figure(figsize=(16,8))
sns.scatterplot(y=np.arange(len(data)),x=data['age'],hue=data['charges'],palette='viridis',size_order='big')
plt.show()

Step 13 :

AGE VS BMI

plt.figure(figsize=(16,8))
sns.scatterplot(y=data['bmi'],x=data['age'],hue=data['charges'],palette='viridis',size_order='big')
plt.show()

 

Step 14 :

Plotting the BMI who has greater than paying the 50000.

plt.figure(figsize=(16,8))
data[data['charges']>50000]['bmi'].plot(kind='hist',color='m')
plt.show()

Step 15 :

Plotting the age who has greater than paying the 50000.

plt.figure(figsize=(16,8))
data[data['charges']>50000]['age'].plot(kind='hist',color='m')
plt.show()

STEP 16:

Checking if there is a relationship any between the mean insurance of smokers and non smokers.

mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean()
mean_smo, mean_non_smo

Output :

(32050.23183153284, 8434.268297856204)

We can conclude from above that the mean insurance charge for smokers is more as they are more likely to die early as compared to non smokers.
 
Step 17 :
Checking for gender as well.
mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean()
mean_male, mean_female

Output :

(13956.751177721893, 12569.578843835347)

We can conclude from above that the mean insurance charge for males and females is almost the same.

Step 18 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean()
mean_bmi_large, mean_bmi_normal

Output :

(13946.476035324473, 10284.290025182185)

We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases

 

Step 19 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean()
mean_young, mean_old

Output ;

(9673.316908395263, 15773.351087515843)

Step 20 :
 
Finding the correlation between the variables by using a heat map.
sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)

 

Feature Selection
 
The effect of the children and region columns on the output variable is very less so remove those columns from the dataset.
data = data.drop(['children','region'], axis = 1)
data.head()

Encoding the categorical variables that are present in our dataset.

cat_var = ['sex','smoker']
data['sex'] = pd.get_dummies(data['sex'], sparse=True)
data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)

 Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".

def coladd (age,smoker,bmi):
    if age>35 and smoker ==1 and bmi >25:
        return 1
    else:
        return 0

data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)

 

data['High_Risk'].value_counts()

Output :

0    819
1    519
Name: High_Risk, dtype: int64.

 
Check if the mean insurance amount of these individuals are high compared to rest or not.
mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean()
mean_high_risk, mean_not_high_risk

Output :

(11226.504362427744, 14565.65229140293).

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

 
Let's build one more by ignoring the age and considering only BMI and smoker.
def coladd1 (smoker,bmi):
    if smoker ==1 and bmi >25:
        return 1
    else:
        return 0

data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)

 

Checking the same analysis for medium risk individuals

mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean()
mean_med_risk, mean_not_med_risk

Output :

(8629.589609712157, 21954.55547444206)

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

dividing the dataset into two parts into X and y.
X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']]
Y = data['charges']

X-Dependent Variables.

y-Independent variables.

 

Splitting the dataset for training and testing the model.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)

 

 

Model Building

Linear regression :

Training the Model by using the training dataset.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

 

Predicting the results.

y_pred_linear = regressor.predict(X_test)

Calculating the mean squared error and variance

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred_linear))
print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))

Output :

Mean squared error: 30637726.75
Variance score: 0.79.

Visualizing the results using scatter plot.
plt.scatter( y_pred_linear, y_test,  color=['red'])

 

Model Building using the lasso regression.

Train the lasso regression model using the training set.

from sklearn.linear_model import LassoCV
lasso_eps = 0.0001
lasso_nalpha=20
lasso_iter=10000
model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5)
model_lasso.fit(X_train,y_train)
LassoCV(cv=5, eps=0.0001, max_iter=10000, n_alphas=20, normalize=True)

To get the variable importance
print(list(zip(model_lasso.coef_,X_train.columns)))
print(model_lasso.intercept_)

Output :

[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')]
8903.607745718311.
Predicting the test set results.
y_predited_lasso = model_lasso.predict(X_test)

Visualizing the results obtained from the lasso regression.

plt.scatter(y_predited_lasso,y_test)
plt.xlabel("Predicted Insurance")
plt.ylabel("Actual Insurance")

Both models are performing well on the test dataset. we can finalize any model for deployment.