The main aim of this project is to predict the insurance claim by each user that was billed by a health insurance company in Python using scikit-learn.

Dataset link:https://www.kaggle.com/mirichoi0218/insurance

About the dataset:

age: age of the primary beneficiary

sex: insurance contractor gender, female, male

BMI: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, an objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance.

Prerequisites :

- Python 3: https://www.python.org/downloads/
- Anaconda: https://www.anaconda.com/download/

Step -1 :

Import the necessary packages

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score

Short Description about the important libraries :

->Pandas is used to create and manipulate the data frames.

->Numpy is used to create and manipulate the arrays

->Matplotlib is used for visualization purposes.

Data visualization and data preprocessing :

Step 1:

import the dataset from the local folder

data=pd.read_csv(r'D:\Codespeedy\Medical-Insurance-Prediction-master\insurance.csv')

data.head()

Step 2:

data.info()

Output :

Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Step 3:

Checking the Null values in the dataset.

data.isna().sum()

Output :

age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64

dataset does not contain null values.

Step 4:

Plotting the classes that are present in the 'region' Column with respect to their frequency.

sns.catplot(x='region', kind="count", data= data);

Step 5:

Plotting the classes that are present in the 'sex' Column with respect to their frequency.

sns.catplot(x='sex', kind="count", data= data);

Step 6:

Plotting the classes that are present in the 'smoker' Column with respect to their frequency.

sns.catplot(x='smoker', kind="count", data= data);

Step 7:

Creating the numerical plot for the continuous variables that are present in the dataset.

def num_plot(x,c='g'): plt.figure(figsize=(16,8)) sns.distplot(data[x],color=c) plt.show() print(10*'----',x,10*'----') print('MIN: ',data[x].min()) print('MIN: ',data[x].max()) print('MIN: ',data[x].mean())

Step 8:

Maximum users age are between 18 and 22

num_plot('age')

The minimum age is 18.

The maximum age is 64

The mean age is 39.20702541106

Step 9 :

num_plot('bmi')

This variable shows the normal distribution.

The minimum BMI is 15.96.

The maximum BMI is 53.13.

The mean BMI is 30.66339686.

Step 10 :

num_plot('children')

Most of the users are unmarried.

The minimum no of children is 0.

The maximum no of children is 5.

The mean no of children is 1.0949177.

Step 11:

num_plot('charges')

Most of the charges are below 100000

The minimum charge is 1121.8739.

The maximum charge is 63770.42801.

The mean charge is 13270.422265.

Step 12 :

AGE VS CHARGES

plt.figure(figsize=(16,8)) sns.scatterplot(y=np.arange(len(data)),x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()

Step 13 :

AGE VS BMI

plt.figure(figsize=(16,8)) sns.scatterplot(y=data['bmi'],x=data['age'],hue=data['charges'],palette='viridis',size_order='big') plt.show()

Step 14 :

Plotting the BMI who has greater than paying the 50000.

plt.figure(figsize=(16,8)) data[data['charges']>50000]['bmi'].plot(kind='hist',color='m') plt.show()

Step 15 :

Plotting the age who has greater than paying the 50000.

plt.figure(figsize=(16,8)) data[data['charges']>50000]['age'].plot(kind='hist',color='m') plt.show()

STEP 16:

Checking if there is a relationship any between the mean insurance of smokers and non smokers.

mean_smo, mean_non_smo = data['charges'][data['smoker'] == 'yes'].mean(), data['charges'][data['smoker'] == 'no'].mean() mean_smo, mean_non_smo

Output :

(32050.23183153284, 8434.268297856204)

We can conclude from above that the mean insurance charge for smokers is more as they are more likely to die early as compared to non smokers.

Step 17 :

Checking for gender as well.

mean_male, mean_female = data['charges'][data['sex'] == 'male'].mean(), data['charges'][data['sex'] == 'female'].mean() mean_male, mean_female

Output :

(13956.751177721893, 12569.578843835347)

We can conclude from above that the mean insurance charge for males and females is almost the same.

Step 18 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_bmi_large, mean_bmi_normal = data['charges'][data['bmi'] > 25].mean(), data['charges'][data['bmi'] <= 25].mean() mean_bmi_large, mean_bmi_normal

Output :

(13946.476035324473, 10284.290025182185)

We can conclude from above that the mean insurance charge for people having higher than normal BMI is more as the are more likely to suffer from diseases

Step 19 :

Check for BMI higher than the normal range - Assumed normal range as 18.5 to 25.

mean_young, mean_old = data['charges'][data['age'] < 35].mean(), data['charges'][data['age'] >= 35].mean() mean_young, mean_old

Output ;

(9673.316908395263, 15773.351087515843)

Step 20 :

Step 20 :

Finding the correlation between the variables by using a heat map.

sns.heatmap(data.corr(), annot = True, vmin=-1, vmax=1, center= 0)

Feature Selection

The effect of the children and region columns on the output variable is very less so remove those columns from the dataset.

data = data.drop(['children','region'], axis = 1)

data.head()

Encoding the categorical variables that are present in our dataset.

cat_var = ['sex','smoker'] data['sex'] = pd.get_dummies(data['sex'], sparse=True) data['smoker'] = pd.get_dummies(data['smoker'], sparse=True)

Create a column where the age of an individual is more than 35, he/she is a smoker and the BMI is also greater than 25. "High risk".

def coladd (age,smoker,bmi): if age>35 and smoker ==1 and bmi >25: return 1 else: return 0 data['High_Risk'] = data[['age','smoker','bmi']].apply(lambda x: coladd(*x), axis=1)

data['High_Risk'].value_counts()

Output :

0 819 1 519 Name: High_Risk, dtype: int64.

Check if the mean insurance amount of these individuals are high compared to rest or not.

mean_high_risk, mean_not_high_risk = data['charges'][data['High_Risk'] == 1].mean(), data['charges'][data['High_Risk'] == 0].mean() mean_high_risk, mean_not_high_risk

Output :

(11226.504362427744, 14565.65229140293).

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Our analysis is right. The high-risk individuals are having almost 4 times the average insurance price compared to the rest.

Let's build one more by ignoring the age and considering only BMI and smoker.

def coladd1 (smoker,bmi): if smoker ==1 and bmi >25: return 1 else: return 0 data['Medium_Risk'] = data[['smoker','bmi']].apply(lambda x: coladd1(*x), axis=1)

Checking the same analysis for medium risk individuals

mean_med_risk, mean_not_med_risk = data['charges'][data['Medium_Risk'] == 1].mean(), data['charges'][data['Medium_Risk'] == 0].mean() mean_med_risk, mean_not_med_risk

Output :

(8629.589609712157, 21954.55547444206)

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

Our Analysis was again right. The medium-risk individuals are having almost 4 times the average insurance price compared to rest.

dividing the dataset into two parts into X and y.

X = data[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']] Y = data['charges']

X-Dependent Variables.

y-Independent variables.

Splitting the dataset for training and testing the model.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)

Linear regression :

Training the Model by using the training dataset.

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train)

Predicting the results.

y_pred_linear = regressor.predict(X_test)

Calculating the mean squared error and variance

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_linear)) print('Variance score: %.2f' % r2_score(y_test, y_pred_linear))

Output :

Mean squared error: 30637726.75 Variance score: 0.79.

Visualizing the results using scatter plot.

plt.scatter( y_pred_linear, y_test, color=['red'])

Model Building using the lasso regression.

Train the lasso regression model using the training set.

from sklearn.linear_model import LassoCV lasso_eps = 0.0001 lasso_nalpha=20 lasso_iter=10000 model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5) model_lasso.fit(X_train,y_train)

LassoCV(cv=5, eps=0.0001, max_iter=10000, n_alphas=20, normalize=True)

To get the variable importance

To get the variable importance

print(list(zip(model_lasso.coef_,X_train.columns))) print(model_lasso.intercept_)

Output :

[(263.9842580546879, 'age'), (-0.0, 'sex'), (425.4205810355094, 'bmi'), (-20037.03930661138, 'smoker'), (-0.0, 'High_Risk'), (-4617.541287287641, 'Medium_Risk')] 8903.607745718311.

Predicting the test set results.

y_predited_lasso = model_lasso.predict(X_test)

Visualizing the results obtained from the lasso regression.

plt.scatter(y_predited_lasso,y_test) plt.xlabel("Predicted Insurance") plt.ylabel("Actual Insurance")

Both models are performing well on the test dataset. we can finalize any model for deployment.

Submitted by Kotha Sai Narasimha Rao (kothasainarasimharao)

- Random Password Generator in C++
- Notes app in Python
- All permutations of a string using next_permutation in C++
- Total Set Bits in a Number in C++
- Simple And Responsive Covid-19 Tracker Web Application Using Covid-19 API In Java
- Live News Web Application Using News API In Java
- Implementing SARSA Algorithm in Machine Learning using Python
- Implementing FP Growth Algorithm in Machine Learning using Python

Download packets of source code on Coders Packet