Hello, Feature selection is a crucial step in building machine learning models, as it helps to enhance the performance and accuracy of your models by removing irrelevant or redundant features. In this tutorial, we will explore different methods of feature selection using scikit-learn, a popular machine-learning library in Python.
We will use the same Iris dataset to demonstrate various feature selection techniques. By the end of this tutorial, you will know how to apply several feature selection methods to optimize your machine-learning models.
Feature Selection Using Scikit-Learn
Step 1: Setting the stage
Let’s begin by importing the necessary libraries and loading the Iris dataset into a DataFrame.
import pandas as pd import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.metrics import classification_report, confusion_matrix # Load the dataset iris = load_iris() iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['species']) iris_df['species'] = iris_df['species'].astype(int)
Step 2 : Preparing the data
We will split the dataset into training and testing sets.
# Features and target variable X = iris_df.drop(columns=['species']) y = iris_df['species'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Training set size: {X_train.shape[0]}") print(f"Test set size: {X_test.shape[0]}")
Output:
Training set size: 120 Test set size: 30
Step 3 : Feature selection methods
We will explore three common feature selection methods:
- Univariate Selection
- Recursive Feature Elimination (RFE)
- Feature Importance
3.1 Univariate Selection
Univariate selection uses statistical tests to select features that have the strongest relationship with the target variable. We will use the SelectKBest class with the chi-squared (chi2) statistical test.
from sklearn.feature_selection import SelectKBest, chi2 # Apply SelectKBest with chi-squared test selector = SelectKBest(score_func=chi2, k=2) X_new = selector.fit_transform(X, y) print("Selected features using chi-squared test:") print(X.columns[selector.get_support()])
Output:
Selected features using chi-squared test: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
3.2 Recursive Feature Elimination (RFE)
RFE recursively removes the least important features and builds a model using the remaining features. We will use an SVM classifier with RFE.
from sklearn.feature_selection import RFE # Create an SVM classifier svc = SVC(kernel="linear") # Apply RFE rfe = RFE(estimator=svc, n_features_to_select=2) rfe.fit(X, y) print("Selected features using RFE:") print(X.columns[rfe.support_])
Output:
Selected features using RFE: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')
3.3 Feature Importance
Feature importance assigns a score to each feature based on its contribution to the prediction. We will use a Random Forest classifier to determine feature importance.
from sklearn.ensemble import RandomForestClassifier # Create a Random Forest classifier rf = RandomForestClassifier() # Fit the model rf.fit(X, y) # Get feature importances importances = rf.feature_importances_ indices = np.argsort(importances)[::-1] print("Feature ranking:") for i in range(X.shape[1]): print(f"{i + 1}. {X.columns[indices[i]]} ({importances[indices[i]]})")
Output:
Feature ranking: 1. petal length (cm) (0.43114039589983164) 2. petal width (cm) (0.4120028692350589) 3. sepal length (cm) (0.1114641635421478) 4. sepal width (cm) (0.045392571322961704)
Step 4: Evaluating the Model with Selected Features
We will use the selected features from each method to build and evaluate an SVM model.
4.1 Using Univariate Selection
# Use selected features from univariate selection X_train_new, X_test_new = X_train.iloc[:, selector.get_support()], X_test.iloc[:, selector.get_support()] # Create a pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC(kernel='linear')) ]) # Fit and evaluate the model pipeline.fit(X_train_new, y_train) y_pred = pipeline.predict(X_test_new) print("\nEvaluation using Univariate Selection:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) print(confusion_matrix(y_test, y_pred))
Output:
Evaluation using Univariate Selection: precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.89 0.94 9 virginica 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30 [[10 0 0] [ 0 8 1] [ 0 0 11]]
4.2 Using RFE
# Use selected features from RFE X_train_new, X_test_new = X_train.iloc[:, rfe.support_], X_test.iloc[:, rfe.support_] # Create a pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC(kernel='linear')) ]) # Fit and evaluate the model pipeline.fit(X_train_new, y_train) y_pred = pipeline.predict(X_test_new) print("\nEvaluation using RFE:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) print(confusion_matrix(y_test, y_pred))
Output:
Evaluation using RFE: precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.89 0.94 9 virginica 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30 [[10 0 0] [ 0 8 1] [ 0 0 11]]
4.3 Using Feature Importance
# Select top 2 features based on feature importance top_features = indices[:2] X_train_new, X_test_new = X_train.iloc[:, top_features], X_test.iloc[:, top_features] # Create a pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('svc', SVC(kernel='linear')) ]) # Fit and evaluate the model pipeline.fit(X_train_new, y_train) y_pred = pipeline.predict(X_test_new) print("\nEvaluation using Feature Importance:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) print(confusion_matrix(y_test, y_pred))
Output:
Evaluation using Feature Importance: precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.89 0.94 9 virginica 0.92 1.00 0.96 11 accuracy 0.97 30 macro avg 0.97 0.96 0.97 30 weighted avg 0.97 0.97 0.97 30 [[10 0 0] [ 0 8 1] [ 0 0 11]]
Conclusion
We have explored three different methods of feature selection using scikit-learn: Univariate Selection, Recursive Feature Elimination (RFE), and Feature Importance. Each method has its own advantages and can be used depending on the specific problem and dataset.
Feature selection is a powerful technique to improve the performance of your machine-learning models by reducing overfitting and making the model simpler and more interpretable.
Happy coding!