In this article, we’ll explore method to split data using train_test_split and discuss key parameters. When building machine learning models it is one of the important step, so that model learns from one part of the data and is evaluated on unseen data.
What is train_test_split and how to use it?
The train_test_split function, which is a part of scikit-learn (sklearn) library, helps to split dataset into training and testing sets. This ensures that model is tested on unseen data, making its evaluation more reliable. Additionally, it helps to prevent overfitting by improving models ability to generalize to new data.
Using train_test_split to split data
In below code, 80% of the data is used for training and 20% of data for testing and random_state ensures reproducibility.
from sklearn.model_selection import train_test_split import numpy as np X = np.arange(10).reshape((5,2)) y = np.array([0,1,0,1,0]) #Splitting data 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("X_train:\n", X_train) print("X_test:\n", X_test) print("y_test:\n", y_test) print("y_train:\n",y_train)
NOTE: test_size=0.2 is standard split for most cases in machine learning.
Key Parameters in train_test_split
- test_size = 0.2 : It specifies that 20% of data will be used for testing and 80% for training.
- random_state = 42: It ensures reproducibility by setting fixed seed.
- shuffle = True: It shuffles the data before splitting.
- stratify = y: It maintains proportion of class labels in split and is mainly used when dataset is imbalanced.
Summary to split data
Splitting data into training and testing set is crutial step in machine learning. The train_test_split function allows efficient way to split the data. Moreover it provides flexibility by introducing parameters like – test_size, random_state, shuffle and stratify.Depending on our need we can set the parameters and efficiently move forward to train the model.
Also read,
- Using Pandas for Exploratory Data Analysis in Machine Learning
- How to Build and Evaluate a Decision Tree Classifier in Python