Health Insurance premium prediction in Python using scikit-learn
The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.
Dataset link:https://www.kaggle.com/mirichoi0218/insurance
About the dataset:
age: age of the primary beneficiary
BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance.
Prerequisites :
- Python 3: https://www.python.org/downloads/
- Anaconda: https://www.anaconda.com/download/
Step -1 :
Import the necessary packages
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score
Short Description about the important libraries :
->Pandas is used to create and manipulate the data frames.
->Numpy is used to create and manipulate the arrays
->Matplotlib is used for visualization purposes.
Data visualization and data preprocessing :
Step 1:
import the dataset from the local folder
data=pd.read_csv(r'D:\Codespeedy\Medical-Insurance-Prediction-master\insurance.csv')
data.head()
Step 2:
data.info()
Output :
Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
Step 3:
Checking the Null values in the dataset.
data.isna().sum()
Output :
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
dataset does not contain null values.
Step 4:
Plotting the classes that are present in the 'region' Column with respect to their frequency.
sns.catplot(x='region', kind="count", data= data);
Step 5:
Plotting the classes that are present in the 'sex' Column with respect to their frequency.
sns.catplot(x='sex', kind="count", data= data);
Step 6:
Plotting the classes that are present in the 'smoker' Column with respect to their frequency.
sns.catplot(x='smoker', kind="count", data= data);
Step 7:
Creating the numerical plot for the continuous variables that are present in the dataset.
def num_plot(x,c='g'):
plt.figure(figsize=(16,8))
sns.distplot(data[x],color=c)
plt.show()
print(10*'----',x,10*'----')
print('MIN: ',data[x].min())
print('MIN: ',data[x].max())
print('MIN: ',data[x].mean())
Step 8:
Maximum users age are between 18 and 22
num_plot('age')
The minimum age is 18.
The maximum age is 64
The mean age is 39.20702541106
Step 9 :
num_plot('bmi')
This variable shows the normal distribution.
The minimum BMI is 15.96.
The maximum BMI is 53.13.
The mean BMI is 30.66339686.
Step 10 :
num_plot('children')
Most of the users are unmarried.
The minimum no of children is 0.
The maximum no of children is 5.
The mean no of children is 1.0949177.
Step 11:
num_plot('charges')
Most of the charges are below 100000
The minimum charge is 1121.8739.
The maximum charge is 63770.42801.
The mean charge is 13270.422265.
plt.figure(figsize=(16,8)) sns.scatterplot(y=np.arange(len(data)),x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()
Step 13 :
AGE VS BMI
plt.figure(figsize=(16,8)) sns.scatterplot(y=data['bmi'],x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()
Step 14 :
Plotting the BMI who has greater than paying the 50000.
plt.figure(figsize=(16,8)) data[data['charges']>50000]['bmi'].plot(kind='hist',color='m') plt.show()
Step 15 :
Plotting the age who has greater than paying the 50000.
plt.figure(figsize=(16,8)) data[data['charges']>50000]['age'].plot(kind='hist',color='m') plt.show()
STEP 16:
Checking if there is a relationship any between the mean insurance of smokers and non smokers.
mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean() mean_smo, mean_non_smo
Output :
mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean() mean_male, mean_female
Output :
(13956.751177721893, 12569.578843835347)
We can conclude from above that the mean insurance charge for males and females is almost the same.
Step 18 :
Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.
mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean() mean_bmi_large, mean_bmi_normal
Output :
(13946.476035324473, 10284.290025182185)
We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases
Step 19 :
Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.
mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean() mean_young, mean_old
Output ;
Step 20 :
sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)
data = data.drop(['children','region'], axis = 1)
data.head()
Encoding the categorical variables that are present in our dataset.
cat_var = ['sex','smoker'] data['sex'] = pd.get_dummies(data['sex'], sparse=True) data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)
Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".
def coladd (age,smoker,bmi):
if age>35 and smoker ==1 and bmi >25:
return 1
else:
return 0
data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)
data['High_Risk'].value_counts()
Output :
0 819 1 519 Name: High_Risk, dtype: int64.
mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean() mean_high_risk, mean_not_high_risk
Output :
Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.
def coladd1 (smoker,bmi):
if smoker ==1 and bmi >25:
return 1
else:
return 0
data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)
Checking the same analysis for medium risk individuals
mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean() mean_med_risk, mean_not_med_risk
Output :
Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.
X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']] Y = data['charges']
X-Dependent Variables.
y-Independent variables.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)
Model Building
Linear regression :
Training the Model by using the training dataset.
from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train)
Predicting the results.
y_pred_linear = regressor.predict(X_test)
Calculating the mean squared error and variance
print("Mean squared error: %.2f"
% mean_squared_error(y_test, y_pred_linear))
print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))
Output :
Mean squared error: 30637726.75 Variance score: 0.79.
Visualizing the results using scatter plot.
plt.scatter( y_pred_linear, y_test, color=['red'])
Model Building using the lasso regression.
Train the lasso regression model using the training set.
from sklearn.linear_model import LassoCV lasso_eps = 0.0001 lasso_nalpha=20 lasso_iter=10000 model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5) model_lasso.fit(X_train,y_train)
To get the variable importance
print(list(zip(model_lasso.coef_,X_train.columns))) print(model_lasso.intercept_)
Output :
[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')] 8903.607745718311.
Predicting the test set results.
y_predited_lasso = model_lasso.predict(X_test)
Visualizing the results obtained from the lasso regression.
plt.scatter(y_predited_lasso,y_test)
plt.xlabel("Predicted Insurance")
plt.ylabel("Actual Insurance")
Both models are performing well on the test dataset. we can finalize any model for deployment.
Project Files
| .. | ||
| This directory is empty. | ||