ColumnTransformer in Scikit-Learn for Data Preprocessing

The ColumnTransformer in Scikit-Learn is a powerful tool for applying different preprocessing steps to specific columns of your dataset. This is particularly useful when working with datasets that contain mixed data types (e.g., numerical and categorical features) and require separate preprocessing pipelines for different types of data.

Key Features
1. Apply Different Transformers to Different Columns: You can specify transformers (like StandardScaler, OneHotEncoder, etc.) for different subsets of columns.
2. Streamlined Workflow: It allows you to integrate preprocessing with Scikit-Learn’s Pipeline for a cleaner and more maintainable code structure.
3. Handles Mixed Data Types: Great for datasets with both numerical and categorical features.
4. Flexibility: You can pass custom transformers or built-in ones, and it supports dropping or passing through unprocessed columns.

code
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample dataset
data = {
    'Age': [25, 32, 47, 51],
    'Salary': [50000, 60000, 120000, 80000],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'New York'],
    'Purchased': [0, 1, 1, 0]
}
df = pd.DataFrame(data)

# Features and target
X = df[['Age', 'Salary', 'City']]
y = df['Purchased']

# Define numeric and categorical columns
numeric_features = ['Age', 'Salary']
categorical_features = ['City']

# Create a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),  # Standardize numeric features
        ('cat', OneHotEncoder(), categorical_features)  # One-hot encode categorical features
    ]
)

# Create a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict on the test set
predictions = pipeline.predict(X_test)

print("Predictions:", predictions)
output
Predictions: [0]

How It Works
1. Numeric Transformation: StandardScaler standardizes the numeric features (e.g., scaling to have mean 0 and variance 1).
2. Categorical Transformation: OneHotEncoder converts categorical columns into dummy variables.
3. Integration: The ColumnTransformer ensures these transformations are applied to the right columns.

Additional Options
1. Dropping Columns: Use ‘drop’ to exclude columns from processing.
2. Passing Through Columns: Use ‘passthrough’ to keep some columns without processing.
3. Custom Transformers: You can define your own transformer by subclassing TransformerMixin.

Advantages
• Streamlines preprocessing for complex datasets.
• Avoids manual preprocessing steps.
• Integrates seamlessly into Scikit-Learn pipelines.

Summary
The ColumnTransformer in Scikit-Learn allows you to apply different preprocessing steps to specific columns in your dataset, making it ideal for handling mixed data types. For example, you can scale numerical features using StandardScaler and encode categorical features with OneHotEncoder in a single unified workflow. It integrates seamlessly with Scikit-Learn’s Pipeline, enabling streamlined preprocessing and model training. This tool simplifies working with complex datasets by allowing you to drop or pass through columns and even use custom transformers, ensuring a clean and efficient machine learning workflow.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top