Feature Selection Using Scikit-Learn

Hello, Feature selection is a crucial step in building machine learning models, as it helps to enhance the performance and accuracy of your models by removing irrelevant or redundant features. In this tutorial, we will explore different methods of feature selection using scikit-learn, a popular machine-learning library in Python.

We will use the same Iris dataset to demonstrate various feature selection techniques. By the end of this tutorial, you will know how to apply several feature selection methods to optimize your machine-learning models.

Feature Selection Using Scikit-Learn

 

Step 1: Setting the stage

Let’s begin by importing the necessary libraries and loading the Iris dataset into a DataFrame.

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Load the dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns=iris['feature_names'] + ['species'])
iris_df['species'] = iris_df['species'].astype(int)

Step 2 : Preparing the data

We will split the dataset into training and testing sets.

# Features and target variable
X = iris_df.drop(columns=['species'])
y = iris_df['species']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Output:

Training set size: 120
Test set size: 30

Step 3 : Feature selection methods

We will explore three common feature selection methods:

  1. Univariate Selection
  2. Recursive Feature Elimination (RFE)
  3. Feature Importance
3.1 Univariate Selection

Univariate selection uses statistical tests to select features that have the strongest relationship with the target variable. We will use the SelectKBest class with the chi-squared (chi2) statistical test.

from sklearn.feature_selection import SelectKBest, chi2

# Apply SelectKBest with chi-squared test
selector = SelectKBest(score_func=chi2, k=2)
X_new = selector.fit_transform(X, y)

print("Selected features using chi-squared test:")
print(X.columns[selector.get_support()])

Output:

Selected features using chi-squared test:
Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
3.2 Recursive Feature Elimination (RFE)

RFE recursively removes the least important features and builds a model using the remaining features. We will use an SVM classifier with RFE.

from sklearn.feature_selection import RFE

# Create an SVM classifier
svc = SVC(kernel="linear")

# Apply RFE
rfe = RFE(estimator=svc, n_features_to_select=2)
rfe.fit(X, y)

print("Selected features using RFE:")
print(X.columns[rfe.support_])

Output:

Selected features using RFE:
Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

3.3 Feature Importance

Feature importance assigns a score to each feature based on its contribution to the prediction. We will use a Random Forest classifier to determine feature importance.

from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier
rf = RandomForestClassifier()

# Fit the model
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for i in range(X.shape[1]):
    print(f"{i + 1}. {X.columns[indices[i]]} ({importances[indices[i]]})")

Output:

Feature ranking:
1. petal length (cm) (0.43114039589983164)
2. petal width (cm) (0.4120028692350589)
3. sepal length (cm) (0.1114641635421478)
4. sepal width (cm) (0.045392571322961704)

Step 4: Evaluating the Model with Selected Features

We will use the selected features from each method to build and evaluate an SVM model.

4.1 Using Univariate Selection
# Use selected features from univariate selection
X_train_new, X_test_new = X_train.iloc[:, selector.get_support()], X_test.iloc[:, selector.get_support()]

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear'))
])

# Fit and evaluate the model
pipeline.fit(X_train_new, y_train)
y_pred = pipeline.predict(X_test_new)
print("\nEvaluation using Univariate Selection:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))

Output:

Evaluation using Univariate Selection:

          precision recall f1-score support

setosa        1.00    1.00   1.00    10
versicolor    1.00    0.89   0.94    9
virginica     0.92    1.00   0.96    11

accuracy      0.97    30
macro avg     0.97    0.96   0.97    30
weighted avg  0.97    0.97   0.97    30

[[10 0 0]
[ 0 8 1]
[ 0 0 11]]
4.2 Using RFE
# Use selected features from RFE
X_train_new, X_test_new = X_train.iloc[:, rfe.support_], X_test.iloc[:, rfe.support_]

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear'))
])

# Fit and evaluate the model
pipeline.fit(X_train_new, y_train)
y_pred = pipeline.predict(X_test_new)
print("\nEvaluation using RFE:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))

Output:

Evaluation using RFE:
           precision recall f1-score support

setosa        1.00    1.00    1.00    10
versicolor    1.00    0.89    0.94    9
virginica     0.92    1.00    0.96    11

accuracy      0.97    30
macro avg     0.97    0.96    0.97    30
weighted avg  0.97    0.97    0.97    30

[[10 0 0]
[ 0 8 1]
[ 0 0 11]]
4.3 Using Feature Importance
# Select top 2 features based on feature importance
top_features = indices[:2]
X_train_new, X_test_new = X_train.iloc[:, top_features], X_test.iloc[:, top_features]

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear'))
])

# Fit and evaluate the model
pipeline.fit(X_train_new, y_train)
y_pred = pipeline.predict(X_test_new)
print("\nEvaluation using Feature Importance:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))

Output:

Evaluation using Feature Importance:

          precision recall f1-score support

setosa       1.00     1.00   1.00     10
versicolor   1.00     0.89   0.94     9
virginica    0.92     1.00   0.96     11

accuracy     0.97     30
macro avg    0.97     0.96   0.97     30
weighted avg 0.97     0.97   0.97     30

[[10 0 0]
[ 0 8 1]
[ 0 0 11]]

Conclusion

We have explored three different methods of feature selection using scikit-learn: Univariate Selection, Recursive Feature Elimination (RFE), and Feature Importance. Each method has its own advantages and can be used depending on the specific problem and dataset.

Feature selection is a powerful technique to improve the performance of your machine-learning models by reducing overfitting and making the model simpler and more interpretable.

Happy coding!

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top