Handling Missing Data using scikit-learn

Hello, Handling missing data is a crucial step in the data preprocessing pipeline. Missing data can lead to inaccurate models and misleading results if not handled properly. Scikit-learn provides several methods to handle missing data efficiently.

In this tutorial, we will use the scikit-learn library to handle missing data in a dataset. We will use the California Housing dataset to demonstrate handling missing data.

Handling Missing Data in a dataset

 

Step 1 : Loading the Data

First, let’s import the necessary libraries and load the California Housing dataset.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame

Let’s introduce some missing values into the dataset for demonstration purpose.

# Introduce missing values for demonstration
np.random.seed(42)
df.loc[np.random.choice(df.index, size=50, replace=False), 'MedInc'] = np.nan

Step 2: Identifying Missing Data

We will first identify the missing data in the dataset.

# Check for missing values
missing_data = df.isnull().sum()
print("Missing values in each column:\n", missing_data)

# Percentage of missing data
missing_percentage = df.isnull().mean() * 100
print("\nPercentage of missing data in each column:\n", missing_percentage)
Output:

Missing values in each column:
MedInc       50
HouseAge     0
AveRooms     0
AveBedrms    0
Population   0
AveOccup     0
Latitude     0
Longitude    0
MedHouseVal  0
dtype: int64

Percentage of missing data in each column:
MedInc        0.147059
HouseAge      0.000000
AveRooms      0.000000
AveBedrms     0.000000
Population    0.000000
AveOccup      0.000000
Latitude      0.000000
Longitude     0.000000
MedHouseVal   0.000000
dtype: float64

Step 3: Handling Missing Data

There are different strategies to handle missing data. We will discuss two common methods: removing missing data and imputing missing data.

Removing Missing Data

Removing rows or columns with missing data is the simplest method but can lead to loss of valuable information.

# Remove rows with missing values
df_dropped = df.dropna()
print("\nShape of the dataset after removing rows with missing values:", df_dropped.shape)

Output:

Shape of the dataset after removing rows with missing values: (340, 9)

Imputing Missing Data

Imputing missing data involves replacing missing values with estimated values. We can use different imputation strategies such as mean, median, or most frequent value.

# Impute missing values using mean strategy
imputer = SimpleImputer(strategy='mean')
df_imputed = df.copy()
df_imputed['MedInc'] = imputer.fit_transform(df[['MedInc']])

# Check if there are any remaining missing values
print("\nMissing values after imputation:\n", df_imputed.isnull().sum())

Output:

Missing values after imputation:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseVal 0
dtype: int64

Step 5: Building and Evaluating a Model

We will build a simple linear regression model to predict the median house value (MedHouseVal). We will compare the performance of the model with and without handling missing data.

Without Handling Missing Data

# Drop rows with missing values
df_dropped = df.dropna()

# Split the data
X = df_dropped.drop(columns=['MedHouseVal'])
y = df_dropped['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build and evaluate the model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_dropped = mean_squared_error(y_test, y_pred)
print("\nMSE without handling missing data:", mse_dropped)

Output:

MSE without handling missing data: 0.5575951281665453

With Imputing Missing Data

Build and evaluate a linear regression model with imputed missing data.

# Impute missing values
df_imputed['MedInc'] = imputer.fit_transform(df[['MedInc']])

# Split the data
X = df_imputed.drop(columns=['MedHouseVal'])
y = df_imputed['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build and evaluate the model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_imputed = mean_squared_error(y_test, y_pred)
print("\nMSE with imputed missing data:", mse_imputed)

Output:

MSE with imputed missing data: 0.5525230362811387

Conclusion

In this tutorial, we covered how to handle missing data using scikit-learn. We demonstrated two common methods: removing rows with missing data and imputing missing values using the mean strategy.We then built and evaluated a simple linear regression model to see the impact of handling missing data on model performance. Proper handling of missing data is crucial for building accurate and reliable machine learning models.
Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top