Handling Missing Data Using Scikit-learn

Introduction

In real-world datasets, missing values are a usual problem. Handling these missing values in an appropriate way is important to have a correct and reliable machine learning model. Most of the techniques in scikit-learn can handle the problem of missing data in variables in their input datasets. Scikit-learn has quite a few tools to handle missing data.

Methods to Handle Missing Data

Removing Data
Imputation

Removing Data

In some cases, it may be legitimate to remove data points or features that have missing values. This is applicable when the percentage of missing data is so low.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])

# Introduce missing values for demonstration
data.iloc[0, 0] = np.nan
data.iloc[1, 2] = np.nan

# Remove rows with missing values
data_cleaned = data.dropna()

# If you want to remove columns with missing values
data_cleaned = data.dropna(axis=1)

Imputation

Imputation is a technique to replace missing values with substitute values. Scikit-learn provides the class Simple Imputer for this purpose.

Simple Imputer

The class Simple Imputer can transform missing values using different techniques like mean, median, frequent, constant value.

from sklearn.impute import SimpleImputer

# Define the imputer
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data (excluding the target column)
data_imputed = imputer.fit_transform(data.iloc[:, :-1])

# Convert back to a DataFrame and include the target column
data_imputed = pd.DataFrame(data_imputed, columns=iris['feature_names'])
data_imputed['target'] = data['target']

print(data_imputed)

Output:

  sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0             5.848322               3.5           1.400000               0.2   
1             4.900000               3.0           3.773826               0.2   
2             4.700000               3.2           1.300000               0.2   
3             4.600000               3.1           1.500000               0.2   
4             5.000000               3.6           1.400000               0.2   
..                 ...               ...                ...               ...   
145           6.700000               3.0           5.200000               2.3   
146           6.300000               2.5           5.000000               1.9   
147           6.500000               3.0           5.200000               2.0   
148           6.200000               3.4           5.400000               2.3   
149           5.900000               3.0           5.100000               1.8   

     target  
0       0.0  
1       0.0  
2       0.0  
3       0.0  
4       0.0  
..      ...  
145     2.0  
146     2.0  
147     2.0  
148     2.0  
149     2.0  

[150 rows x 5 columns]

Imputation of Categorical Data

You would substitute missing values of categorical data by the most frequent value or a constant. Here is an example synthetically created:

# Example categorical data with missing values
categorical_data = pd.DataFrame({
    'animal': ['cat', 'dog', np.nan, 'dog', 'cat', np.nan],
    'color': ['white', 'black', 'white', np.nan, 'black', 'black']
})

# Define the imputer for categorical data
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the data
categorical_data_imputed = imputer.fit_transform(categorical_data)

print(categorical_data_imputed)

Output:

[['cat' 'white']
 ['dog' 'black']
 ['cat' 'white']
 ['dog' 'black']
 ['cat' 'black']
 ['cat' 'black']]

Advanced Imputation Techniques

It also provides advanced imputation techniques; in scikit-learn, this is done via the IterativeImputer, which models each feature with missing values as a function of other features and iteratively improves the estimates.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Introduce missing values for demonstration
data.iloc[2, 1] = np.nan
data.iloc[3, 3] = np.nan

# Define the iterative imputer
iterative_imputer = IterativeImputer()

# Fit and transform the data (excluding the target column)
data_imputed_iterative = iterative_imputer.fit_transform(data.iloc[:, :-1])

# Convert back to a DataFrame and include the target column
data_imputed_iterative = pd.DataFrame(data_imputed_iterative, columns=iris['feature_names'])
data_imputed_iterative['target'] = data['target']

print(data_imputed_iterative)

Output:

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0              5.00876          3.500000            1.40000          0.200000   
1              4.90000          3.000000            1.66783          0.200000   
2              4.70000          3.250361            1.30000          0.200000   
3              4.60000          3.100000            1.50000          0.281159   
4              5.00000          3.600000            1.40000          0.200000   
..                 ...               ...                ...               ...   
145            6.70000          3.000000            5.20000          2.300000   
146            6.30000          2.500000            5.00000          1.900000   
147            6.50000          3.000000            5.20000          2.000000   
148            6.20000          3.400000            5.40000          2.300000   
149            5.90000          3.000000            5.10000          1.800000   

     target  
0       0.0  
1       0.0  
2       0.0  
3       0.0  
4       0.0  
..      ...  
145     2.0  
146     2.0  
147     2.0  
148     2.0  
149     2.0  

[150 rows x 5 columns]

Conclusion

Handling missing data is an important step in the pipeline of data preprocessing. Depending on how it is spread out or proportionately missing, different strategies apply. Scikit-learn provides a basic and advanced slate of imputation techniques that will ensure your machine learning models have clean and complete data.

Introduction

Methods to Handle Missing Data

Removing Data

Imputation

Simple Imputer

Output:

Imputation of Categorical Data

Output:

Advanced Imputation Techniques

Output:

Related Posts

Leave a Comment Cancel Reply