In this project, we will create a logistic regression model to predict whether or not a patient’s heart failure is fatal.
Logistic Regression is one of the most fundamental algorithms used in prediction/binary classification. It is very easy to implement in Python and is great to predict outcomes on linear data. It is also quick and not too computationally expensive, making it popular amongst not only students but also researchers and data scientists.
Thanks to the Scikit-learn library, it is very easy to implement and create your own regression model in Python. So without any further delay, let's get right into it!
Dataset
The dataset we are going to work with is available at this Google Drive link. We will look into the contents a little later.
Installing Dependencies
Before we start working with our data, let us look at the dependencies for this project. We are going to be working with Scikit-learn and will obviously require both NumPy and Pandas to work with the data. Additionally, we will also make use of Seaborn which is a library for data visualization. This will help us determine the crucial factors that affect our prediction simply by looking at some beautifully generated graphs and heatmaps.
To install these, go to your command window/terminal and type the following commands:
pip install scikit-learn pip install numpy pip install pandas pip install seaborn pip install matplotlib
There! We now have all the dependencies required to execute our code. Let us now start the coding by first importing all the libraries we need.
Importing Libraries
import pandas as pd import numpy as np import seaborn as sns from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt
Next, we need to take a look at the contents of our dataset.
Analyzing the data
Luckily, this dataset has no missing values or random data in it. The data preprocessing has been taken care of. We can thus move on to reading the data and looking at its attributes. Use the following code snippet to read the data:
data = pd.read_csv('path/to/data') data.head(5) # View first 5 rows
You can view the data from the folder attached to this packet.
Data Visualization
Next, let us take a look at the various attributes and how they compare to each other. This is essentially a correlation matrix. We are going to use the seaborn library for this. Use the following code snippet to view the correlation matrix heatmap:
# Correlation matrix # Seeing how each of the parameters is related corrmat = data.corr() f, ax = plt.subplots(figsize=(10, 10)) sns.heatmap(corrmat, vmax=.8, square=True)
The heatmap looks something like this:
From what we see on the heatmap, the attributes that affect the death of a patient drastically are-
1)Age
2)Serum Creatinine
3)Ejection Fraction
4)Time
Based on our observation, let us now make a regression model where we feed these 4 factors as our input and predict the death event attribute, i.e whether or not the patient will die as a 0 or 1 output.
Regression Model
Let's divide our data into training and testing sets:
# Train and test split key_factors = data[['age','serum_creatinine','ejection_fraction','time']] #print (key_factors.head()) train = key_factors[:250].to_numpy() train_labels = data['DEATH_EVENT'][:250] test = key_factors[250:].to_numpy() test_labels = data['DEATH_EVENT'][250:]
Let us now create our logistic regression model and fit our training and testing data to it.
We will use the Scikit-learn library to make our model. The code looks something like this:
# Fitting to logistic regression model regr = LogisticRegression() regr.fit(train, train_labels) # Fit training data y_hat = regr.predict(test) # Predict outcome on test set print ("Loss=", np.mean((y_hat - test_labels) ** 2)) # Print final loss print ("Accuracy=", regr.score(test, test_labels)) # Print accuracy of our model on test data
We obtain a very high validation accuracy of about 95% on our test set and a loss of about 0.04 which is also very good. As you can see, this is really simple to implement as long as you know what factors to train the model on. It also takes a very short amount of time to make these predictions, just by writing some simple code and not carrying out additional hyperparameter tuning, we were able to obtain a decent result.
The Python Script was coded and run on a Kaggle notebook.
Submitted by Praatibh Surana (praatibhsurana)
Download packets of source code on Coders Packet
Comments