Introduction
In real-world datasets, missing values are a usual problem. Handling these missing values in an appropriate way is important to have a correct and reliable machine learning model. Most of the techniques in scikit-learn can handle the problem of missing data in variables in their input datasets. Scikit-learn has quite a few tools to handle missing data.
Methods to Handle Missing Data
- Removing Data
- Imputation
Removing Data
In some cases, it may be legitimate to remove data points or features that have missing values. This is applicable when the percentage of missing data is so low.
import numpy as np import pandas as pd from sklearn.datasets import load_iris # Load the Iris dataset iris = load_iris() data = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target']) # Introduce missing values for demonstration data.iloc[0, 0] = np.nan data.iloc[1, 2] = np.nan # Remove rows with missing values data_cleaned = data.dropna() # If you want to remove columns with missing values data_cleaned = data.dropna(axis=1)
Imputation
Imputation is a technique to replace missing values with substitute values. Scikit-learn provides the class Simple Imputer for this purpose.
Simple Imputer
The class Simple Imputer can transform missing values using different techniques like mean, median, frequent, constant value.
from sklearn.impute import SimpleImputer # Define the imputer imputer = SimpleImputer(strategy='mean') # Fit and transform the data (excluding the target column) data_imputed = imputer.fit_transform(data.iloc[:, :-1]) # Convert back to a DataFrame and include the target column data_imputed = pd.DataFrame(data_imputed, columns=iris['feature_names']) data_imputed['target'] = data['target'] print(data_imputed)
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \ 0 5.848322 3.5 1.400000 0.2 1 4.900000 3.0 3.773826 0.2 2 4.700000 3.2 1.300000 0.2 3 4.600000 3.1 1.500000 0.2 4 5.000000 3.6 1.400000 0.2 .. ... ... ... ... 145 6.700000 3.0 5.200000 2.3 146 6.300000 2.5 5.000000 1.9 147 6.500000 3.0 5.200000 2.0 148 6.200000 3.4 5.400000 2.3 149 5.900000 3.0 5.100000 1.8 target 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 .. ... 145 2.0 146 2.0 147 2.0 148 2.0 149 2.0 [150 rows x 5 columns]
Imputation of Categorical Data
You would substitute missing values of categorical data by the most frequent value or a constant. Here is an example synthetically created:
# Example categorical data with missing values categorical_data = pd.DataFrame({ 'animal': ['cat', 'dog', np.nan, 'dog', 'cat', np.nan], 'color': ['white', 'black', 'white', np.nan, 'black', 'black'] }) # Define the imputer for categorical data imputer = SimpleImputer(strategy='most_frequent') # Fit and transform the data categorical_data_imputed = imputer.fit_transform(categorical_data) print(categorical_data_imputed)
Output:
[['cat' 'white'] ['dog' 'black'] ['cat' 'white'] ['dog' 'black'] ['cat' 'black'] ['cat' 'black']]
Advanced Imputation Techniques
It also provides advanced imputation techniques; in scikit-learn, this is done via the IterativeImputer, which models each feature with missing values as a function of other features and iteratively improves the estimates.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Introduce missing values for demonstration data.iloc[2, 1] = np.nan data.iloc[3, 3] = np.nan # Define the iterative imputer iterative_imputer = IterativeImputer() # Fit and transform the data (excluding the target column) data_imputed_iterative = iterative_imputer.fit_transform(data.iloc[:, :-1]) # Convert back to a DataFrame and include the target column data_imputed_iterative = pd.DataFrame(data_imputed_iterative, columns=iris['feature_names']) data_imputed_iterative['target'] = data['target'] print(data_imputed_iterative)
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \ 0 5.00876 3.500000 1.40000 0.200000 1 4.90000 3.000000 1.66783 0.200000 2 4.70000 3.250361 1.30000 0.200000 3 4.60000 3.100000 1.50000 0.281159 4 5.00000 3.600000 1.40000 0.200000 .. ... ... ... ... 145 6.70000 3.000000 5.20000 2.300000 146 6.30000 2.500000 5.00000 1.900000 147 6.50000 3.000000 5.20000 2.000000 148 6.20000 3.400000 5.40000 2.300000 149 5.90000 3.000000 5.10000 1.800000 target 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 .. ... 145 2.0 146 2.0 147 2.0 148 2.0 149 2.0 [150 rows x 5 columns]
Conclusion
Handling missing data is an important step in the pipeline of data preprocessing. Depending on how it is spread out or proportionately missing, different strategies apply. Scikit-learn provides a basic and advanced slate of imputation techniques that will ensure your machine learning models have clean and complete data.