ColumnTransformer in Scikit-Learn for Data Preprocessing

ColumnTransformer in Scikit-Learn is very robust tool for doing operation on different subsets of features in dataset.
In this tutorial we will learn how to use it.

Steps:

  • Import necessary libraries
  • Create sample dataset
  • Define transformations
  • Apply ColumnTransformer
Import necessary libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
import pandas as pd

For sake of simplicity, we will be using the data set iris which can be imported using the library sklearn.

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Adding a categorical column to demonstrate mixed transformations
df['species'] = iris.target_names[iris.target]
df['missing'] = None  # Add a column with missing values for demonstration

In this case:

  • We have numerical columns for features like sepal length, width, and petal length.
  • We also have a categorical column species.
Define transformations

For numerical data transformations in this case, we will impute any missing values using mean.
And scale the feature using StandardScalar.

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with the mean
    ('scaler', StandardScaler())  # Scale features to standardize them
])

For Categorical data transformation we will impute the missing values by replacing them with frequent categories.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])
Fit and transform the data
processed_data = preprocessor.fit_transform(df)
Summary:
  • Organizes the code for better clarity and maintainability.
  • Reduce repetitive code.
  • Imputes missing data very accurately which prevents model errors.

It is good practice to follow because, it ensures consistency which is important while dealing with big datasets.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top