Multi-Output Regression using Scikit-learn

Hello, Multi-output regression is a type of regression analysis where multiple target variables are predicted simultaneously. This is useful in scenarios where you want to predict multiple related outcomes from the same set of input features. In this tutorial, we’ll explore how to implement and optimize a multi-output regression model using Scikit-learn.

Step 1: Import Libraries and Load Data

We’ll start by importing the necessary libraries and loading a dataset suitable for multi-output regression.

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

For simplicity, we’ll use a synthetic dataset provided by Scikit-learn.

# Create a synthetic dataset for multi-output regression
X, y = make_regression(n_samples=200, n_features=10, n_targets=3, noise=0.1, random_state=42)

# Convert to DataFrame for easier manipulation
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y = pd.DataFrame(y, columns=[f'target_{i}' for i in range(y.shape[1])])

Step 2: Explore the Data

Let’s take a quick look at the data to understand its structure.

# Descriptive statistics for features
print("Descriptive statistics of the features:")
print(X.describe())

# Descriptive statistics for targets
print("\nDescriptive statistics of the targets:")
print(y.describe())

Output:

Descriptive statistics of the features:

      feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9
count 200.000000  200.000000  200.000000  200.000000  200.000000  200.000000  200.000000  200.000000  200.000000  200.000000
mean  0.002352    0.022755    0.003835    0.034182    0.005960    0.039495    0.010206    -0.045157   0.012827    0.033108
std   1.015624    0.987032    1.005582    0.965866    1.003157    1.007547    0.962328    0.975496    0.970484    1.018156
min   -2.721914   -2.517678   -2.539479   -2.382374   -2.522725   -2.770594   -2.729664   -2.592636   -2.798728   -2.881425
25%   -0.727874   -0.595448   -0.686160   -0.611953   -0.654137   -0.609942   -0.659542   -0.726058   -0.689139   -0.601338
50%   0.038284    0.035900    0.015700    0.070594    0.019228    0.062201    0.026836    -0.038230   0.054936    0.059469
75%   0.715494    0.700083    0.675392    0.693450    0.720345    0.674857    0.677657     0.630337   0.683173    0.715103
max   3.103694    2.606345    2.668710    2.859170    2.960878    2.946066    2.689672     2.693725   2.480960    2.992877

Descriptive statistics of the targets:
      target_0     target_1    target_2
count 200.000000  200.000000   200.000000
mean  -0.037967   -0.023492    0.040885
std   214.982035  101.451349   138.444276
min   -401.188729 -200.486015  -268.142680
25%   -158.401141 -73.178413   -97.665157
50%   -1.733486   -0.639736    2.732703
75%   154.014206  62.072922    108.171837
max   404.263448  212.599406   282.476989

Step 3: Prepare the Data

We’ll split the dataset into training and testing sets.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Output:

Training set size: 160
Test set size: 40

Step 4: Build a Pipeline

We’ll create a pipeline that standardizes the data and trains a multi-output regression model using Ridge regression.

# Creating a pipeline with StandardScaler and MultiOutputRegressor
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', MultiOutputRegressor(Ridge()))
])

Step 5: Define Hyperparameters

We will define a grid of hyperparameters to optimize.

# Hyperparameter grid for Ridge regression
param_grid = {
    'regressor__estimator__alpha': [0.1, 1.0, 10.0]
}

Step 6: Perform Grid Search

Using GridSearchCV, we will find the best combination of hyperparameters.

# Performing grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

Step 7: Evaluate the Model

Finally, we evaluate the model with metrics like mean squared error (MSE) and R-squared (R²) score.

# Best parameters found by GridSearchCV
print("Best parameters found:", grid_search.best_params_)

# Best score achieved during cross-validation
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate on the test set
y_pred = grid_search.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test set mean squared error: {mse:.2f}")
print(f"Test set R² score: {r2:.2f}")

Output:

Best parameters found: {'regressor__estimator__alpha': 1.0}
Best cross-validation score: -10180.35
Test set mean squared error: 11094.36
Test set R² score: 0.89

Step 8: Fine-Tuning and Improvements

If the model’s performance is not satisfactory, consider experimenting with different algorithms, more complex pipelines, or additional features. Here are some suggestions:

Try different regression models like RandomForestRegressor, GradientBoostingRegressor or SVR.

  • Use feature selection techniques to identify the most important features.
  • Experiment with ensemble methods to combine the predictions of multiple models.

Conclusion

We have walked through the process of building and optimizing a multi-output regression model using Scikit-learn. This approach allows you to efficiently predict multiple related outcomes from the same set of input features. With the flexibility of Scikit-learn, you can easily extend this workflow to more complex scenarios and datasets.

Happy coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top