The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.
Dataset link:https://www.kaggle.com/mirichoi0218/insurance
About the dataset:
age: age of the primary beneficiary
BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance.
Prerequisites :
Step -1 :
Import the necessary packages
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score
Short Description about the important libraries :
->Pandas is used to create and manipulate the data frames.
->Numpy is used to create and manipulate the arrays
->Matplotlib is used for visualization purposes.
Data visualization and data preprocessing :
Step 1:
import the dataset from the local folder
data=pd.read_csv(r'D:\Codespeedy\Medical-Insurance-Prediction-master\insurance.csv')
data.head()
Step 2:
data.info()
Output :
Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
Step 3:
Checking the Null values in the dataset.
data.isna().sum()
Output :
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
dataset does not contain null values.
Step 4:
Plotting the classes that are present in the 'region' Column with respect to their frequency.
sns.catplot(x='region', kind="count", data= data);
Step 5:
Plotting the classes that are present in the 'sex' Column with respect to their frequency.
sns.catplot(x='sex', kind="count", data= data);
Step 6:
Plotting the classes that are present in the 'smoker' Column with respect to their frequency.
sns.catplot(x='smoker', kind="count", data= data);
Step 7:
Creating the numerical plot for the continuous variables that are present in the dataset.
def num_plot(x,c='g'): plt.figure(figsize=(16,8)) sns.distplot(data[x],color=c) plt.show() print(10*'----',x,10*'----') print('MIN: ',data[x].min()) print('MIN: ',data[x].max()) print('MIN: ',data[x].mean())
Step 8:
Maximum users age are between 18 and 22
num_plot('age')
The minimum age is 18.
The maximum age is 64
The mean age is 39.20702541106
Step 9 :
num_plot('bmi')
This variable shows the normal distribution.
The minimum BMI is 15.96.
The maximum BMI is 53.13.
The mean BMI is 30.66339686.
Step 10 :
num_plot('children')
Most of the users are unmarried.
The minimum no of children is 0.
The maximum no of children is 5.
The mean no of children is 1.0949177.
Step 11:
num_plot('charges')
Most of the charges are below 100000
The minimum charge is 1121.8739.
The maximum charge is 63770.42801.
The mean charge is 13270.422265.
plt.figure(figsize=(16,8)) sns.scatterplot(y=np.arange(len(data)),x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()
Step 13 :
AGE VS BMI
plt.figure(figsize=(16,8)) sns.scatterplot(y=data['bmi'],x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()
Step 14 :
Plotting the BMI who has greater than paying the 50000.
plt.figure(figsize=(16,8)) data[data['charges']>50000]['bmi'].plot(kind='hist',color='m') plt.show()
Step 15 :
Plotting the age who has greater than paying the 50000.
plt.figure(figsize=(16,8)) data[data['charges']>50000]['age'].plot(kind='hist',color='m') plt.show()
STEP 16:
Checking if there is a relationship any between the mean insurance of smokers and non smokers.
mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean() mean_smo, mean_non_smo
Output :
mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean() mean_male, mean_female
Output :
(13956.751177721893, 12569.578843835347)
We can conclude from above that the mean insurance charge for males and females is almost the same.
Step 18 :
Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.
mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean() mean_bmi_large, mean_bmi_normal
Output :
(13946.476035324473, 10284.290025182185)
We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases
Step 19 :
Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.
mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean() mean_young, mean_old
Output ;
sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)
data = data.drop(['children','region'], axis = 1)
data.head()
Encoding the categorical variables that are present in our dataset.
cat_var = ['sex','smoker'] data['sex'] = pd.get_dummies(data['sex'], sparse=True) data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)
Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".
def coladd (age,smoker,bmi): if age>35 and smoker ==1 and bmi >25: return 1 else: return 0 data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)
data['High_Risk'].value_counts()
Output :
0 819 1 519 Name: High_Risk, dtype: int64.
mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean() mean_high_risk, mean_not_high_risk
Output :
def coladd1 (smoker,bmi): if smoker ==1 and bmi >25: return 1 else: return 0 data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)
Checking the same analysis for medium risk individuals
mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean() mean_med_risk, mean_not_med_risk
Output :
X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']] Y = data['charges']
X-Dependent Variables.
y-Independent variables.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)
Linear regression :
Training the Model by using the training dataset.
from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train)
Predicting the results.
y_pred_linear = regressor.predict(X_test)
Calculating the mean squared error and variance
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_linear)) print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))
Output :
Mean squared error: 30637726.75 Variance score: 0.79.
Visualizing the results using scatter plot.
plt.scatter( y_pred_linear, y_test, color=['red'])
Model Building using the lasso regression.
Train the lasso regression model using the training set.
from sklearn.linear_model import LassoCV lasso_eps = 0.0001 lasso_nalpha=20 lasso_iter=10000 model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5) model_lasso.fit(X_train,y_train)
print(list(zip(model_lasso.coef_,X_train.columns))) print(model_lasso.intercept_)
Output :
[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')] 8903.607745718311.
Predicting the test set results.
y_predited_lasso = model_lasso.predict(X_test)
Visualizing the results obtained from the lasso regression.
plt.scatter(y_predited_lasso,y_test) plt.xlabel("Predicted Insurance") plt.ylabel("Actual Insurance")
Both models are performing well on the test dataset. we can finalize any model for deployment.
Submitted by Kotha Sai Narasimha Rao (kothasainarasimharao)
Download packets of source code on Coders Packet