How to Split Data into Training and Testing Sets in Python

In this article, we’ll explore method to split data using train_test_split and discuss key parameters. When building machine learning models it is one of the important step, so that model learns from one part of the data and is evaluated on unseen data.

What is train_test_split and how to use it?

The train_test_split function, which is a part of scikit-learn (sklearn) library, helps to split dataset into training and testing sets. This ensures that model is tested on unseen data, making its evaluation more reliable. Additionally, it helps to prevent overfitting by improving models ability to generalize to new data.

Using train_test_split to split data

In below code, 80% of the data is used for training and 20% of data for testing and random_state ensures reproducibility.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5,2))
y = np.array([0,1,0,1,0])

#Splitting data 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_test:\n", y_test)
print("y_train:\n",y_train)

NOTE: test_size=0.2 is standard split for most cases in machine learning.

Key Parameters in train_test_split

test_size = 0.2 : It specifies that 20% of data will be used for testing and 80% for training.
random_state = 42: It ensures reproducibility by setting fixed seed.
shuffle = True: It shuffles the data before splitting.
stratify = y: It maintains proportion of class labels in split and is mainly used when dataset is imbalanced.

Summary to split data

Splitting data into training and testing set is crutial step in machine learning. The train_test_split function allows efficient way to split the data. Moreover it provides flexibility by introducing parameters like – test_size, random_state, shuffle and stratify.Depending on our need we can set the parameters and efficiently move forward to train the model.

What is train_test_split and how to use it?

Using train_test_split to split data

Key Parameters in train_test_split

Summary to split data

Related Posts

Leave a Comment Cancel Reply