Hyperparameter Tuning in Scikit-Learn Using GridSearchCV & RandomizedSearchCV
Although machine learning models are effective tools, the selection of hyperparameters has a significant impact on how well they work. One of the most important steps in creating a machine learning model is hyperparameter tuning. To get the greatest results, it entails fine-tuning the model’s hyperparameters. GridSearchCV and RandomizedSearchCV are two crucial methods and well-liked techniques for hyperparameter tuning offered by scikit-learn that we will examine in this guide.
GridSearchCV lets you exhaustively search through a grid of hyperparameters to find the best combination for your model, while RandomizedSearchCV randomly samples from a distribution of possible combinations. Both methods help save time by automating the process of parameter tuning and finding the optimal values for your model. It’s like having a personal assistant helping you navigate through the vast sea of hyperparameters to make sure your model is performing at its best. So next time you’re feeling overwhelmed by all those parameters, just remember GridSearchCV and RandomizedSearchCV are there to lend a helping hand.
What are Hyperparameters?
Hyperparameters in machine learning are basically like the knobs and switches you can tweak on a model to try and maximize its performance. Think of them as the recipe ingredients you can adjust to get that perfect chocolate chip cookie. These parameters aren’t learned during training, unlike the model’s weights, but they impact how well the model learns those weights.
Things like learning rates, batch sizes, and regularization strengths are all examples of hyperparameters. It’s kind of like finding that sweet spot where your model isn’t underfitting or overfitting – just right for making accurate predictions. So, tinkering with these hyperparameters is key to optimizing your machine learning model and getting it to work like a charm. Selecting appropriate hyperparameters can:
- Improve model accuracy and performance.
- Prevent overfitting by controlling model complexity.
- Enhance model generalization to new, unseen data.
GridSearchCV
GridSearchCV is a powerful tool in Scikit-learn’s model_selection package designed to streamline the process of hyperparameter tuning. By looping through predefined hyperparameters, GridSearchCV fits the selected estimator (model) on your training set multiple times to identify the optimal parameter combination. This systematic approach ensures that you test various configurations efficiently, saving time and effort compared to manual tuning. As a result, GridSearchCV helps you achieve the best possible performance for your machine learning model without the need for tedious trial and error.
Think of it as a personal assistant for your model’s hyperparameter tuning. Similar to trying out multiple outfit combinations before picking the perfect one for a big event, GridSearchCV automatically explores different parameter settings to find the optimal match. This method eliminates the guesswork involved in model fine-tuning and ensures you can focus more on interpreting results rather than tweaking settings manually. By simplifying the tuning process, GridSearchCV becomes an invaluable tool for achieving top-notch model performance with minimal hassle.
Step-by-Step Guide to Using GridSearchCV
Step 1: Importing Dataset
To explain the hyperparameter tuning using GridSearchCV we will be using Iris dataset by importing it from Kaggle (https://www.kaggle.com/datasets/uciml/iris) .
import kagglehub # Download latest version path = kagglehub.dataset_download("uciml/iris") print("Path to dataset files:", path)
Step 2: Importing the Required Libraries
We will be importing the following libraries:
- numpy: Used for numerical operations.
- load_iris(): Imports the Iris dataset, a small, popular dataset for classification tasks.
- train_test_split(): Splits the dataset into training and testing sets.
- GridSearchCV: The function that will help us find the best hyperparameters.
- SVC: The Support Vector Classifier model from Scikit-learn.
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.svm import SVC
Step 3: Loading the Dataset and Splitting It
The Iris dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.The .data attribute contains the features, and .target holds the labels. The train_test_split() function splits the dataset:
- test_size=0.3: Reserves 30% of the data for testing.
- random_state=42: Ensures reproducibility by fixing the random split.
iris = load_iris() # Load the Iris dataset X = iris.data # Features (sepal/petal lengths and widths) y = iris.target # Target labels (species of flowers) # Split data into training (70%) and testing (30%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Defining the Model and Hyperparameter Grid
Here C, gamma and kernels are some of the hyperparameters of an SVM model. GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.
The param_grid defines the hyperparameters to test:
- C: Controls the regularization strength (higher values reduce underfitting but may increase overfitting).
- kernel: Chooses the type of decision boundary.
- gamma: Defines how far points can influence each other in the RBF kernel.
GridSearchCV will test every possible combination of these parameters (3 × 2 × 3 = 18 combinations).
model = SVC() param_grid = { 'C': [0.1, 1, 10], # Regularization strength 'kernel': ['linear', 'rbf'], # Linear and RBF kernel options 'gamma': [0.01, 0.1, 1] # Kernel coefficient for RBF kernel }
Step 5: Running GridSearchCV
GridSearchCV() accepts the following key arguments:
- model: The model to optimize (here, SVC).
- param_grid: The dictionary of hyperparameters to test.
- cv=5: Uses 5-fold cross-validation, meaning the training data is split into 5 parts, and the model is trained 5 times — each time using 4 parts for training and 1 part for validation. This improves model reliability.
- scoring=’accuracy’: Tells GridSearchCV to evaluate each combination based on accuracy.
.fit() trains the model using all combinations of hyperparameters.
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy') grid_search.fit(X_train, y_train)
Step 6 : Displaying Results
- .best_params_ identifies the optimal hyperparameter combination.
- .best_score_ shows the highest accuracy achieved during cross-validation.
- .best_estimator_ retrieves the best-trained model, which is then evaluated on the test set using .score().
print("Best Parameters:", grid_search.best_params_) print("Best Score:", grid_search.best_score_) test_score = grid_search.best_estimator_.score(X_test, y_test) print("Test Score:", test_score)
Output:
Here the output indicates that:
- The best combination of hyperparameters (C = 1, gamma = 0.01, and kernel = ‘linear’) gave the highest accuracy during training.
- The Best Score (0.96) is the average cross-validation accuracy.
- The Test Score (0.9778) is the accuracy of the tuned model on unseen test data.
RandomizedSearchCV
When tuning machine learning models, finding the best hyperparameters can be challenging. Testing every possible combination using GridSearchCV can become slow and inefficient, especially when there are many parameters with a wide range of values. This is where RandomizedSearchCV comes in — a faster and smarter alternative that efficiently searches for the best hyperparameters without trying every single combination.
RandomizedSearchCV is like your personal genie that helps you find the best hyperparameters for your machine learning model. Instead of having to manually try out different combinations and waste endless hours, this nifty tool randomly selects a set of parameters to test out and finds the optimal ones through cross-validation. It’s like playing a game of trial and error but with super speed and efficiency. You can also specify the number of iterations it should go through, so you have control over how thorough you want the search to be. So, if you want to level up your model’s performance without breaking a sweat, just call on RandomizedSearchCV.
Step-by-Step Guide to Using RandomizedSearchCV
Step 1: Importing Dataset
To explain the hyperparameter tuning using RandomizedSearchCV we will be using Iris dataset by importing it from Kaggle (https://www.kaggle.com/datasets/uciml/iris) .
import kagglehub # Download latest version path = kagglehub.dataset_download("uciml/iris") print("Path to dataset files:", path)
Step 2: Importing the Required Libraries
We will be importing the following libraries:
- load_iris(): Loads the Iris dataset for demonstration.
-
train_test_split(): Splits the dataset into training and testing sets.
-
RandomizedSearchCV: The function that performs random hyperparameter tuning.
-
SVC: The Support Vector Classifier model we are optimizing.
-
uniform: Used to create a continuous distribution for randomly sampling hyperparameter values.
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.svm import SVC from scipy.stats import uniform
Step 3: Loading the Dataset and Splitting It
The Iris dataset contains 150 samples, each with 4 features and 3 classes (flower species). The train_test_split() function reserves 30% of the data for testing.
iris = load_iris() # Load the Iris dataset X = iris.data # Features (sepal/petal lengths and widths) y = iris.target # Target labels (species of flowers) # Split data into training (70%) and testing (30%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 4: Defining the Model and Hyperparameter Grid
Here, param_dist defines the hyperparameters to sample from:
-
-
C: Random values between 0.1 and 10 for regularization.
-
kernel: Chooses between a linear or RBF kernel.
-
gamma: Random values between 0.01 and 1 for controlling the influence of each training point.
-
Unlike GridSearchCV, which tests all combinations, RandomizedSearchCV randomly picks values within these ranges.
model = SVC() param_dist = { 'C': uniform(0.1, 10), # Random values between 0.1 and 10 'kernel': ['linear', 'rbf'], # Linear and RBF kernels 'gamma': uniform(0.01, 1) # Random values between 0.01 and 1 }
Step 5: Running RandomizedSearchCV
RandomizedSearchCV() accepts the following key arguments:
- n_iter=10: Limits the search to 10 random combinations (instead of testing all possibilities).
- cv=5: Performs 5-fold cross-validation to ensure reliable evaluation.
- scoring=’accuracy’: Uses accuracy as the evaluation metric.
- random_state=42: Ensures reproducibility by controlling randomness.
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42) random_search.fit(X_train, y_train)
The fewer combinations you specify in n_iter, the faster RandomizedSearchCV runs — but fewer combinations may reduce your chances of finding the absolute best parameters.
Step 6 : Displaying Results
- .best_params_: Displays the optimal hyperparameter combination found.
- .best_score_: Shows the highest accuracy achieved during cross-validation.
- .best_estimator_: Retrieves the best-trained model for evaluation on the test set.
print("Best Parameters:", random_search.best_params_) print("Best Score:", random_search.best_score_) test_score = random_search.best_estimator_.score(X_test, y_test) print("Test Score:", test_score)
Output:
- The best combination of hyperparameters was C = 3.437, gamma = 0.153, and kernel = ‘linear’.
- The Best Score (0.971) reflects the highest cross-validation accuracy.
- The Test Score (0.9778) shows that the model generalizes well on unseen data.
Conclusion
RandomizedSearchCV is a powerful and efficient way to improve your machine learning model’s performance by automating hyperparameter tuning. While GridSearchCV is exhaustive, RandomizedSearchCV achieves similar results faster by sampling random combinations.
If your model has numerous hyperparameters or large value ranges, RandomizedSearchCV is the smarter, faster choice.