Optimization of Modeling Pipeline Optimization with scikit-learn

Hello! We’re going to explore how to optimize a modeling pipeline using scikit-learn, one of the most popular machine-learning libraries in Python. Optimizing your pipeline can greatly enhance the performance of your models by automating the process of selecting the best parameters and improving the overall workflow.

We will be using the famous Iris dataset, which contains measurements of iris flowers, to demonstrate our optimization process. By the end of this tutorial, you will know how to build and optimize a machine-learning pipeline using scikit-learn. Note that ‘scikit-learn’ and ‘sklearn’ are the same.

Building an Optimized Modeling Pipeline with scikit-learn

Step 1: Setting the stage

Let’s begin by importing the necessary libraries and loading the Iris dataset. We’ll use pandas for data handling and scikit-learn to build and optimize our model.

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

Next, we load the Iris dataset and convert it into a Pandas DataFrame.

# Load the dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns=iris['feature_names'] + ['species'])
iris_df['species'] = iris_df['species'].astype(int)
Step 2: Exploring the Data

Next, we’ll use descriptive statistics to understand the database better.

# Descriptive statistics
print("Descriptive statistics of the features:")
print(iris_df.describe())

# Checking for class balance
print("\nClass distribution:")
print(iris_df['species'].value_counts())

Output:

Descriptive statistics of the features:
      sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count       150.000000       150.000000        150.000000       150.000000
mean          5.843333         3.057333          3.758000         1.199333
std           0.828066         0.435866          1.765298         0.762238
min           4.300000         2.000000          1.000000         0.100000
25%           5.100000         2.800000          1.600000         0.300000
50%           5.800000         3.000000          4.350000         1.300000
75%           6.400000         3.300000          5.100000         1.800000
max           7.900000         4.400000          6.900000         2.500000

Class distribution:
0 50
1 50
2 50
Name: species, dtype: int64
Step 3: Preparing the Data

We will split the dataset into training and testing sets.

# Features and target variable
X = iris_df.drop(columns=['species'])
y = iris_df['species']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Output:

Training set size: 120
Test set size: 30
Step 4: Building a Pipeline

We will create a pipeline that standardizes the data and trains an SVM classifier.

# Creating a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])
Step 5: Defining Hyperparameters

We will define a grid of hyperparameters to optimize.

# Hyperparameter grid
param_grid = {
    'svc__kernel': ['linear', 'rbf'],
    'svc__C': [0.1, 1, 10],
    'svc__gamma': [0.1, 0.01, 0.001]
}
Step 6: Performing Grid Search

Using GridSearchCV, we will find the best combination of hyperparameters.

# Performing grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
Step 7: Evaluating the Model

Finally, we evaluate the model with metrics like accuracy, precision, recall, F1-score, and confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix

# Best parameters found by GridSearchCV
print("Best parameters found:", grid_search.best_params_)

# Best score achieved during cross-validation
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate on the test set
y_pred = grid_search.predict(X_test)
test_score = grid_search.score(X_test, y_test)
print("Test set accuracy:", test_score)

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Output:

Best parameters found: {'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}
Best cross-validation score: 0.975
Test set accuracy: 1.0

Classification Report:
              precision    recall    f1-score    support

      setosa       1.00      1.00        1.00         10
  versicolor       1.00      1.00        1.00          9
   virginica       1.00      1.00        1.00         11

    accuracy                             1.00         30
   macro avg       1.00      1.00        1.00         30
weighted avg       1.00      1.00        1.00         30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Step 8: Fine-Tuning and Improvements

If the model’s performance is not satisfactory, consider experimenting with different algorithms, more complex pipelines or additional features.

Conclusion

We have walked through the process of optimizing a machine learning pipeline using scikit-learn, from loading the data to evaluating our model with comprehensive metrics. This approach allows you to efficiently fine-tune your model and achieve the best possible performance.

Happy coding!

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top