How To do Train Test Split Using Sklearn in python

Introduction: Data splitting is a critical step in building machine learning models, ensuring their accuracy and generalization. In Python, Sklearn provides powerful tools for this task, notably the Train-Test Split method. In this guide, we’ll delve into the process of splitting data using Sklearn, implementing it with a student dataset to solidify understanding.

Understanding Train-Test Split: The Train-Test Split method partitions a dataset into two subsets: one for training the model and the other for testing its performance. This prevents the model from memorizing the data (overfitting) and assesses its ability to generalize to new, unseen data.

Implementation Steps:

Importing Necessary Libraries: Import Sklearn and other required libraries like Pandas for dataset handling.
Loading the Dataset: Load the student dataset into a Pandas DataFrame. This dataset typically contains features like student scores, attendance, etc., and the target variable, such as student grades.
Data Preprocessing: Preprocess the dataset as needed, handling missing values, encoding categorical variables, and scaling numerical features.
Splitting the Data: Use Sklearn’s train_test_split method to split the dataset into training and testing sets. Specify the test size (e.g., 20% of the data) and optionally set a random state for reproducibility.
Building and Training the Model: Choose a machine learning algorithm suitable for the task (e.g., regression for predicting grades) and instantiate the model. Train the model using the training data.
Evaluating Model Performance: Use the testing data to evaluate the model’s performance. Calculate metrics like accuracy, precision, recall, or mean squared error, depending on the problem type.
Iteration and Optimization: Fine-tune the model parameters, try different algorithms, or explore feature engineering techniques to improve performance.

We can see how to implement this is in Python

student_info.csv

import pandas as pd

s = pd.read_csv(r"/content/student_info1.csv")

we can read the csv file using pandas

Output:

study_hours                student_marks
0   6.83                           78.50
1    6.56                          76.74
2    NaN                           78.68
3    5.67                          71.82
4    8.67                          84.19
...  ...                            ...
195  7.53                          81.67
196  8.56                          84.68
197  8.94                          86.75
198  6.60                          78.05
199   8.35                         83.50

200 rows × 2 columns

x=s.drop(["study_hours"],axis="columns")

y=s["student_marks"]

So in the above code we can see that we drop study hours from x and add student marks in y so we can say that we set x is a independent variable and y is dependent variable because student marks is depend on the study hours

x

So we can print x to see the output

Output:

    student_marks

0    78.50

1    76.74

2   78.68

3    71.82

4    84.19

...      ...

195   81.67

196   84.68

197   86.75

198    78.05

199    83.50

200 rows × 1 columns

y

Output:

0 78.50
1 76.74
2 78.68
3 71.82
4 84.19
  ... 
195 81.67
196 84.68
197 86.75
198 78.05
199 83.50
Name: student_marks, Length: 200, dtype: float64

Now we can see that we seperate student_marks and study_hours as x and y variable

Next step we can split those variable as train set and test set using sklearn

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2020)

Now step by step explaining the above code :

from sklearn.model_selection import train_test_split: This line imports a function called train_test_split from the model_selection module of the sklearn library. sklearn (short for scikit-learn) is a popular library in Python used for machine learning tasks. The train_test_split function helps in splitting a dataset into two parts: one for training a machine learning model and the other for testing its performance.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2020): This line actually splits the dataset into four parts:
- x is typically the input features (or independent variables) of the dataset.
- y is usually the target variable (or dependent variable) we want to predict.
- test_size=0.2 specifies that 20% of the dataset will be used for testing the model, and the remaining 80% will be used for training. You can adjust this value to allocate different proportions for training and testing.

random_state=2020 is a parameter that ensures reproducibility of the split. Setting it to a specific number (in this case, 2020) ensures that every time you run the code, the data will be split in the same way, which can be useful for debugging and sharing results.

x_train

The above code print the x_train data

Output:

    student_marks

76      72.08

34      83.08

1      76.74

40      70.27

151     76.70

...      ...

195      81.67

118      73.61

67       81.70

136      83.15

96       75.39

160 rows × 1 columns

y_train

Output:

76       72.08
34       83.08
1        76.74
40       70.27
151      76.70
          ... 
195     81.67
118     73.61
67       81.70
136     83.15
96        75.39
Name: student_marks, Length: 160, dtype: float64

so we can print the train data now we can see the test set data

x_test

Output:

   student_marks
199   83.50
119   75.55
186   85.10
27    75.65
57    86.65
145   85.15
161   79.49
169   83.08
23    75.02
191   70.51
130   73.19
28    74.15
42    71.10
177   73.64
154   78.45
164   82.04
108   74.25
115   74.44
87    81.74
89    84.60
95    76.48
92    72.08
70    71.80
35    76.76
143   75.52
41    86.41
104   77.55
175   71.11
5     81.18
163   77.07
181   71.83
2     78.68
117   85.04
68    69.27
140   84.58
101   82.03
11    83.88
88    71.85
120   76.20
196   84.68

y_test

Output:

199   83.50
119   75.55
186   85.10
27    75.65
57    86.65
145   85.15
161   79.49
169   83.08
23    75.02
191   70.51
130   73.19
28    74.15
42    71.10
177   73.64
154   78.45
164   82.04
108   74.25
115   74.44
87    81.74
89    84.60
95    76.48
92    72.08
70    71.80
35    76.76
143   75.52
41    86.41
104   77.55
175   71.11
5     81.18
163   77.07
181   71.83
2     78.68
117   85.04
68    69.27
140   84.58
101   82.03
11    83.88
88    71.85
120   76.20
196   84.68
Name: student_marks, dtype: float64

Now print the shape of train set and set set

x_train.shape

Output:

(160, 1)

y_train.shape

Output:

(160,)

x_test.shape

Output:

(40, 1)

y_test.shape

Output:

(40,)

Conclusion: By following these steps, you’ve successfully mastered the Train-Test Split technique using Sklearn in Python. Understanding data splitting is crucial for building robust and reliable machine learning models. Experiment with different datasets and model configurations to deepen your understanding and enhance your machine learning skills.

Implementation Steps:

Related Posts

Leave a Comment Cancel Reply