The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.

Dataset link:https://www.kaggle.com/mirichoi0218/insurance

About the dataset:

age: age of the primary beneficiary

sex: insurance contractor gender, female, male

BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.

Prerequisites :

- Python 3: https://www.python.org/downloads/
- Anaconda: https://www.anaconda.com/download/

Step -1 :

Import the necessary packages

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score

Short Description about the important libraries :

->Pandas is used to create and manipulate the data frames.

->Numpy is used to create and manipulate the arrays

->Matplotlib is used for visualization purposes.

Data visualization and data preprocessing :

Step 1:

import the dataset from the local folder

data=pd.read_csv(r'D:\Codespeedy\Medical-Insurance-Prediction-master\insurance.csv')

data.head()

Step 2:

data.info()

Output :

Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Step 3:

Checking the Null values in the dataset.

data.isna().sum()

Output :

age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64

dataset does not contain null values.

Step 4:

Plotting the classes that are present in the 'region' Column with respect to their frequency.

sns.catplot(x='region', kind="count", data= data);

Step 5:

Plotting the classes that are present in the 'sex' Column with respect to their frequency.

sns.catplot(x='sex', kind="count", data= data);

Step 6:

Plotting the classes that are present in the 'smoker' Column with respect to their frequency.

sns.catplot(x='smoker', kind="count", data= data);

Step 7:

Creating the numerical plot for the continuous variables that are present in the dataset.

def num_plot(x,c='g'): plt.figure(figsize=(16,8)) sns.distplot(data[x],color=c) plt.show() print(10*'----',x,10*'----') print('MIN: ',data[x].min()) print('MIN: ',data[x].max()) print('MIN: ',data[x].mean())

Step 8:

Maximum users age are between 18 and 22

num_plot('age')

The minimum age is 18.

The maximum age is 64

The mean age is 39.20702541106

Step 9 :

num_plot('bmi')

This variable shows the normal distribution.

The minimum BMI is 15.96.

The maximum BMI is 53.13.

The mean BMI is 30.66339686.

Step 10 :

num_plot('children')

Most of the users are unmarried.

The minimum no of children is 0.

The maximum no of children is 5.

The mean no of children is 1.0949177.

Step 11:

num_plot('charges')

Most of the charges are below 100000

The minimum charge is 1121.8739.

The maximum charge is 63770.42801.

The mean charge is 13270.422265.

Step 12 :

AGE VS CHARGES

plt.figure(figsize=(16,8)) sns.scatterplot(y=np.arange(len(data)),x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()

Step 13 :

AGE VS BMI

plt.figure(figsize=(16,8)) sns.scatterplot(y=data['bmi'],x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()

Step 14 :

Plotting the BMI who has greater than paying the 50000.

plt.figure(figsize=(16,8)) data[data['charges']>50000]['bmi'].plot(kind='hist',color='m') plt.show()

Step 15 :

Plotting the age who has greater than paying the 50000.

plt.figure(figsize=(16,8)) data[data['charges']>50000]['age'].plot(kind='hist',color='m') plt.show()

STEP 16:

Checking if there is a relationship any between the mean insurance of smokers and non smokers.

mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean() mean_smo, mean_non_smo

Output :

(32050.23183153284, 8434.268297856204)

We can conclude from above that the mean insurance charge for smokers is more as they are more likely to die early as compared to non smokers.

Step 17 :

Checking for gender as well.

mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean() mean_male, mean_female

Output :

(13956.751177721893, 12569.578843835347)

We can conclude from above that the mean insurance charge for males and females is almost the same.

Step 18 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean() mean_bmi_large, mean_bmi_normal

Output :

(13946.476035324473, 10284.290025182185)

We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases

Step 19 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean() mean_young, mean_old

Output ;

(9673.316908395263, 15773.351087515843)

Step 20 :

Step 20 :

Finding the correlation between the variables by using a heat map.

sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)

Feature Selection

The effect of the children and region columns on the output variable is very less so remove those columns from the dataset.

data = data.drop(['children','region'], axis = 1)

data.head()

Encoding the categorical variables that are present in our dataset.

cat_var = ['sex','smoker'] data['sex'] = pd.get_dummies(data['sex'], sparse=True) data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)

Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".

def coladd (age,smoker,bmi): if age>35 and smoker ==1 and bmi >25: return 1 else: return 0 data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)

data['High_Risk'].value_counts()

Output :

0 819 1 519 Name: High_Risk, dtype: int64.

Check if the mean insurance amount of these individuals are high compared to rest or not.

mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean() mean_high_risk, mean_not_high_risk

Output :

(11226.504362427744, 14565.65229140293).

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Let's build one more by ignoring the age and considering only BMI and smoker.

def coladd1 (smoker,bmi): if smoker ==1 and bmi >25: return 1 else: return 0 data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)

Checking the same analysis for medium risk individuals

mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean() mean_med_risk, mean_not_med_risk

Output :

(8629.589609712157, 21954.55547444206)

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

dividing the dataset into two parts into X and y.

X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']] Y = data['charges']

X-Dependent Variables.

y-Independent variables.

Splitting the dataset for training and testing the model.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)

Linear regression :

Training the Model by using the training dataset.

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train)

Predicting the results.

y_pred_linear = regressor.predict(X_test)

Calculating the mean squared error and variance

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_linear)) print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))

Output :

Mean squared error: 30637726.75 Variance score: 0.79.

Visualizing the results using scatter plot.

plt.scatter( y_pred_linear, y_test, color=['red'])

Model Building using the lasso regression.

Train the lasso regression model using the training set.

from sklearn.linear_model import LassoCV lasso_eps = 0.0001 lasso_nalpha=20 lasso_iter=10000 model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5) model_lasso.fit(X_train,y_train)

LassoCV(cv=5, eps=0.0001, max_iter=10000, n_alphas=20, normalize=True)

To get the variable importance

To get the variable importance

print(list(zip(model_lasso.coef_,X_train.columns))) print(model_lasso.intercept_)

Output :

[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')] 8903.607745718311.

Predicting the test set results.

y_predited_lasso = model_lasso.predict(X_test)

Visualizing the results obtained from the lasso regression.

plt.scatter(y_predited_lasso,y_test) plt.xlabel("Predicted Insurance") plt.ylabel("Actual Insurance")

Both models are performing well on the test dataset. we can finalize any model for deployment.

Submitted by Kotha Sai Narasimha Rao (kothasainarasimharao)

Download packets of source code on Coders Packet

## Comments