The ColumnTransformer in Scikit-Learn is a powerful tool for applying different preprocessing steps to specific columns of your dataset. This is particularly useful when working with datasets that contain mixed data types (e.g., numerical and categorical features) and require separate preprocessing pipelines for different types of data.
Key Features
1. Apply Different Transformers to Different Columns: You can specify transformers (like StandardScaler, OneHotEncoder, etc.) for different subsets of columns.
2. Streamlined Workflow: It allows you to integrate preprocessing with Scikit-Learn’s Pipeline for a cleaner and more maintainable code structure.
3. Handles Mixed Data Types: Great for datasets with both numerical and categorical features.
4. Flexibility: You can pass custom transformers or built-in ones, and it supports dropping or passing through unprocessed columns.
code
import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # Sample dataset data = { 'Age': [25, 32, 47, 51], 'Salary': [50000, 60000, 120000, 80000], 'City': ['New York', 'San Francisco', 'Los Angeles', 'New York'], 'Purchased': [0, 1, 1, 0] } df = pd.DataFrame(data) # Features and target X = df[['Age', 'Salary', 'City']] y = df['Purchased'] # Define numeric and categorical columns numeric_features = ['Age', 'Salary'] categorical_features = ['City'] # Create a ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), # Standardize numeric features ('cat', OneHotEncoder(), categorical_features) # One-hot encode categorical features ] ) # Create a pipeline pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression()) ]) # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit the pipeline pipeline.fit(X_train, y_train) # Predict on the test set predictions = pipeline.predict(X_test) print("Predictions:", predictions)
output
Predictions: [0]
How It Works
1. Numeric Transformation: StandardScaler standardizes the numeric features (e.g., scaling to have mean 0 and variance 1).
2. Categorical Transformation: OneHotEncoder converts categorical columns into dummy variables.
3. Integration: The ColumnTransformer ensures these transformations are applied to the right columns.
Additional Options
1. Dropping Columns: Use ‘drop’ to exclude columns from processing.
2. Passing Through Columns: Use ‘passthrough’ to keep some columns without processing.
3. Custom Transformers: You can define your own transformer by subclassing TransformerMixin.
Advantages
• Streamlines preprocessing for complex datasets.
• Avoids manual preprocessing steps.
• Integrates seamlessly into Scikit-Learn pipelines.
Summary
The ColumnTransformer in Scikit-Learn allows you to apply different preprocessing steps to specific columns in your dataset, making it ideal for handling mixed data types. For example, you can scale numerical features using StandardScaler and encode categorical features with OneHotEncoder in a single unified workflow. It integrates seamlessly with Scikit-Learn’s Pipeline, enabling streamlined preprocessing and model training. This tool simplifies working with complex datasets by allowing you to drop or pass through columns and even use custom transformers, ensuring a clean and efficient machine learning workflow.