Hello, Handling missing data is a crucial step in the data preprocessing pipeline. Missing data can lead to inaccurate models and misleading results if not handled properly. Scikit-learn provides several methods to handle missing data efficiently.
In this tutorial, we will use the scikit-learn library to handle missing data in a dataset. We will use the California Housing dataset to demonstrate handling missing data.
Handling Missing Data in a dataset
Step 1 : Loading the Data
First, let’s import the necessary libraries and load the California Housing dataset.
import pandas as pd import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Load the dataset housing = fetch_california_housing(as_frame=True) df = housing.frame
Let’s introduce some missing values into the dataset for demonstration purpose.
# Introduce missing values for demonstration np.random.seed(42) df.loc[np.random.choice(df.index, size=50, replace=False), 'MedInc'] = np.nan
Step 2: Identifying Missing Data
We will first identify the missing data in the dataset.
# Check for missing values missing_data = df.isnull().sum() print("Missing values in each column:\n", missing_data) # Percentage of missing data missing_percentage = df.isnull().mean() * 100 print("\nPercentage of missing data in each column:\n", missing_percentage)
Output: Missing values in each column: MedInc 50 HouseAge 0 AveRooms 0 AveBedrms 0 Population 0 AveOccup 0 Latitude 0 Longitude 0 MedHouseVal 0 dtype: int64 Percentage of missing data in each column: MedInc 0.147059 HouseAge 0.000000 AveRooms 0.000000 AveBedrms 0.000000 Population 0.000000 AveOccup 0.000000 Latitude 0.000000 Longitude 0.000000 MedHouseVal 0.000000 dtype: float64
Step 3: Handling Missing Data
There are different strategies to handle missing data. We will discuss two common methods: removing missing data and imputing missing data.
Removing Missing Data
Removing rows or columns with missing data is the simplest method but can lead to loss of valuable information.
# Remove rows with missing values df_dropped = df.dropna() print("\nShape of the dataset after removing rows with missing values:", df_dropped.shape)
Output:
Shape of the dataset after removing rows with missing values: (340, 9)
Imputing Missing Data
Imputing missing data involves replacing missing values with estimated values. We can use different imputation strategies such as mean, median, or most frequent value.
# Impute missing values using mean strategy imputer = SimpleImputer(strategy='mean') df_imputed = df.copy() df_imputed['MedInc'] = imputer.fit_transform(df[['MedInc']]) # Check if there are any remaining missing values print("\nMissing values after imputation:\n", df_imputed.isnull().sum())
Output:
Missing values after imputation: MedInc 0 HouseAge 0 AveRooms 0 AveBedrms 0 Population 0 AveOccup 0 Latitude 0 Longitude 0 MedHouseVal 0 dtype: int64
Step 5: Building and Evaluating a Model
We will build a simple linear regression model to predict the median house value (MedHouseVal). We will compare the performance of the model with and without handling missing data.
Without Handling Missing Data
# Drop rows with missing values df_dropped = df.dropna() # Split the data X = df_dropped.drop(columns=['MedHouseVal']) y = df_dropped['MedHouseVal'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Build and evaluate the model model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse_dropped = mean_squared_error(y_test, y_pred) print("\nMSE without handling missing data:", mse_dropped)
Output:
MSE without handling missing data: 0.5575951281665453
With Imputing Missing Data
Build and evaluate a linear regression model with imputed missing data.
# Impute missing values df_imputed['MedInc'] = imputer.fit_transform(df[['MedInc']]) # Split the data X = df_imputed.drop(columns=['MedHouseVal']) y = df_imputed['MedHouseVal'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Build and evaluate the model model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse_imputed = mean_squared_error(y_test, y_pred) print("\nMSE with imputed missing data:", mse_imputed)
Output:
MSE with imputed missing data: 0.5525230362811387
Conclusion
In this tutorial, we covered how to handle missing data using scikit-learn. We demonstrated two common methods: removing rows with missing data and imputing missing values using the mean strategy.We then built and evaluated a simple linear regression model to see the impact of handling missing data on model performance. Proper handling of missing data is crucial for building accurate and reliable machine learning models.
Happy Coding!