ColumnTransformer in Scikit-Learn is very robust tool for doing operation on different subsets of features in dataset.
In this tutorial we will learn how to use it.
Steps:
- Import necessary libraries
- Create sample dataset
- Define transformations
- Apply ColumnTransformer
Import necessary libraries
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.datasets import load_iris import pandas as pd
For sake of simplicity, we will be using the data set iris which can be imported using the library sklearn.
# Load the Iris dataset iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) # Adding a categorical column to demonstrate mixed transformations df['species'] = iris.target_names[iris.target] df['missing'] = None # Add a column with missing values for demonstration
In this case:
- We have numerical columns for features like sepal length, width, and petal length.
- We also have a categorical column species.
Define transformations
For numerical data transformations in this case, we will impute any missing values using mean.
And scale the feature using StandardScalar.
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), # Fill missing values with the mean ('scaler', StandardScaler()) # Scale features to standardize them ])
For Categorical data transformation we will impute the missing values by replacing them with frequent categories.
categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing with the most frequent value ('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical features ])
Fit and transform the data
processed_data = preprocessor.fit_transform(df)
Summary:
- Organizes the code for better clarity and maintainability.
- Reduce repetitive code.
- Imputes missing data very accurately which prevents model errors.
It is good practice to follow because, it ensures consistency which is important while dealing with big datasets.