ColumnTransformer in Scikit-Learn for Data Preprocessing

ColumnTransformer in Scikit-Learn is very robust tool for doing operation on different subsets of features in dataset.
In this tutorial we will learn how to use it.

Steps:

Import necessary libraries
Create sample dataset
Define transformations
Apply ColumnTransformer

Import necessary libraries

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
import pandas as pd

For sake of simplicity, we will be using the data set iris which can be imported using the library sklearn.

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Adding a categorical column to demonstrate mixed transformations
df['species'] = iris.target_names[iris.target]
df['missing'] = None  # Add a column with missing values for demonstration

In this case:

We have numerical columns for features like sepal length, width, and petal length.
We also have a categorical column species.

Define transformations

For numerical data transformations in this case, we will impute any missing values using mean.
And scale the feature using StandardScalar.

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with the mean
    ('scaler', StandardScaler())  # Scale features to standardize them
])

For Categorical data transformation we will impute the missing values by replacing them with frequent categories.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

Fit and transform the data

processed_data = preprocessor.fit_transform(df)

Summary:

Organizes the code for better clarity and maintainability.
Reduce repetitive code.
Imputes missing data very accurately which prevents model errors.

It is good practice to follow because, it ensures consistency which is important while dealing with big datasets.

Steps:

Import necessary libraries

Define transformations

Fit and transform the data

Summary:

Related Posts

Leave a Comment Cancel Reply