Logistic Regression in Python using Scikit-Learn

In this project, we will create a logistic regression model to predict whether or not a patient’s heart failure is fatal.

Logistic Regression is one of the most fundamental algorithms used in prediction/binary classification. It is very easy to implement in Python and is great to predict outcomes on linear data. It is also quick and not too computationally expensive, making it popular amongst not only students but also researchers and data scientists.

Thanks to the Scikit-learn library, it is very easy to implement and create your own regression model in Python. So without any further delay, let's get right into it!

Dataset

The dataset we are going to work with is available at this Google Drive link. We will look into the contents a little later.

Installing Dependencies

Before we start working with our data, let us look at the dependencies for this project. We are going to be working with Scikit-learn and will obviously require both NumPy and Pandas to work with the data. Additionally, we will also make use of Seaborn which is a library for data visualization. This will help us determine the crucial factors that affect our prediction simply by looking at some beautifully generated graphs and heatmaps.

To install these, go to your command window/terminal and type the following commands:

pip install scikit-learn
pip install numpy
pip install pandas
pip install seaborn
pip install matplotlib

There! We now have all the dependencies required to execute our code. Let us now start the coding by first importing all the libraries we need.

Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

Next, we need to take a look at the contents of our dataset.

Analyzing the data

Luckily, this dataset has no missing values or random data in it. The data preprocessing has been taken care of. We can thus move on to reading the data and looking at its attributes. Use the following code snippet to read the data:

data = pd.read_csv('path/to/data')
data.head(5) # View first 5 rows

You can view the data from the folder attached to this packet.

Data Visualization

Next, let us take a look at the various attributes and how they compare to each other. This is essentially a correlation matrix. We are going to use the seaborn library for this. Use the following code snippet to view the correlation matrix heatmap:

# Correlation matrix
# Seeing how each of the parameters is related

corrmat = data.corr()
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrmat, vmax=.8, square=True)

The heatmap looks something like this:

Heatmap representing correlations between features of the patients

From what we see on the heatmap, the attributes that affect the death of a patient drastically are-

1)Age

2)Serum Creatinine

3)Ejection Fraction

4)Time

Based on our observation, let us now make a regression model where we feed these 4 factors as our input and predict the death event attribute, i.e whether or not the patient will die as a 0 or 1 output.

Regression Model

Let's divide our data into training and testing sets:

# Train and test split

key_factors = data[['age','serum_creatinine','ejection_fraction','time']]
#print (key_factors.head())
train = key_factors[:250].to_numpy()
train_labels = data['DEATH_EVENT'][:250]
test = key_factors[250:].to_numpy()
test_labels = data['DEATH_EVENT'][250:]

Let us now create our logistic regression model and fit our training and testing data to it.

We will use the Scikit-learn library to make our model. The code looks something like this:

# Fitting to logistic regression model

regr = LogisticRegression()
regr.fit(train, train_labels) # Fit training data
y_hat = regr.predict(test) # Predict outcome on test set

print ("Loss=", np.mean((y_hat - test_labels) ** 2)) # Print final loss
print ("Accuracy=", regr.score(test, test_labels)) # Print accuracy of our model on test data

We obtain a very high validation accuracy of about 95% on our test set and a loss of about 0.04 which is also very good. As you can see, this is really simple to implement as long as you know what factors to train the model on. It also takes a very short amount of time to make these predictions, just by writing some simple code and not carrying out additional hyperparameter tuning, we were able to obtain a decent result.

Coders Packet

Logistic Regression in Python using Scikit-Learn

Comments