Introduction: Data splitting is a critical step in building machine learning models, ensuring their accuracy and generalization. In Python, Sklearn provides powerful tools for this task, notably the Train-Test Split method. In this guide, we’ll delve into the process of splitting data using Sklearn, implementing it with a student dataset to solidify understanding.
Understanding Train-Test Split: The Train-Test Split method partitions a dataset into two subsets: one for training the model and the other for testing its performance. This prevents the model from memorizing the data (overfitting) and assesses its ability to generalize to new, unseen data.
Implementation Steps:
- Importing Necessary Libraries: Import Sklearn and other required libraries like Pandas for dataset handling.
- Loading the Dataset: Load the student dataset into a Pandas DataFrame. This dataset typically contains features like student scores, attendance, etc., and the target variable, such as student grades.
- Data Preprocessing: Preprocess the dataset as needed, handling missing values, encoding categorical variables, and scaling numerical features.
- Splitting the Data: Use Sklearn’s train_test_split method to split the dataset into training and testing sets. Specify the test size (e.g., 20% of the data) and optionally set a random state for reproducibility.
- Building and Training the Model: Choose a machine learning algorithm suitable for the task (e.g., regression for predicting grades) and instantiate the model. Train the model using the training data.
- Evaluating Model Performance: Use the testing data to evaluate the model’s performance. Calculate metrics like accuracy, precision, recall, or mean squared error, depending on the problem type.
- Iteration and Optimization: Fine-tune the model parameters, try different algorithms, or explore feature engineering techniques to improve performance.
We can see how to implement this is in Python
import pandas as pd
s = pd.read_csv(r"/content/student_info1.csv")
we can read the csv file using pandas
s
Output:
study_hours student_marks 0 6.83 78.50 1 6.56 76.74 2 NaN 78.68 3 5.67 71.82 4 8.67 84.19 ... ... ... 195 7.53 81.67 196 8.56 84.68 197 8.94 86.75 198 6.60 78.05 199 8.35 83.50 200 rows × 2 columns
x=s.drop(["study_hours"],axis="columns")
y=s["student_marks"]
So in the above code we can see that we drop study hours from x and add student marks in y so we can say that we set x is a independent variable and y is dependent variable because student marks is depend on the study hours
x
So we can print x to see the output
Output:
student_marks 0 78.50 1 76.74 2 78.68 3 71.82 4 84.19 ... ... 195 81.67 196 84.68 197 86.75 198 78.05 199 83.50 200 rows × 1 columns
y
Output:
0 78.50 1 76.74 2 78.68 3 71.82 4 84.19 ... 195 81.67 196 84.68 197 86.75 198 78.05 199 83.50 Name: student_marks, Length: 200, dtype: float64
Now we can see that we seperate student_marks and study_hours as x and y variable
Next step we can split those variable as train set and test set using sklearn
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2020)
Now step by step explaining the above code :
from sklearn.model_selection import train_test_split
: This line imports a function calledtrain_test_split
from themodel_selection
module of thesklearn
library.sklearn
(short for scikit-learn) is a popular library in Python used for machine learning tasks. Thetrain_test_split
function helps in splitting a dataset into two parts: one for training a machine learning model and the other for testing its performance.x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2020)
: This line actually splits the dataset into four parts:- x is typically the input features (or independent variables) of the dataset.
y
is usually the target variable (or dependent variable) we want to predict.test_size=0.2
specifies that 20% of the dataset will be used for testing the model, and the remaining 80% will be used for training. You can adjust this value to allocate different proportions for training and testing.
random_state=2020
is a parameter that ensures reproducibility of the split. Setting it to a specific number (in this case, 2020) ensures that every time you run the code, the data will be split in the same way, which can be useful for debugging and sharing results.
x_train
The above code print the x_train data
Output:
student_marks 76 72.08 34 83.08 1 76.74 40 70.27 151 76.70 ... ... 195 81.67 118 73.61 67 81.70 136 83.15 96 75.39 160 rows × 1 columns
y_train
Output:
76 72.08 34 83.08 1 76.74 40 70.27 151 76.70 ... 195 81.67 118 73.61 67 81.70 136 83.15 96 75.39 Name: student_marks, Length: 160, dtype: float64
so we can print the train data now we can see the test set data
x_test
Output:
student_marks 199 83.50 119 75.55 186 85.10 27 75.65 57 86.65 145 85.15 161 79.49 169 83.08 23 75.02 191 70.51 130 73.19 28 74.15 42 71.10 177 73.64 154 78.45 164 82.04 108 74.25 115 74.44 87 81.74 89 84.60 95 76.48 92 72.08 70 71.80 35 76.76 143 75.52 41 86.41 104 77.55 175 71.11 5 81.18 163 77.07 181 71.83 2 78.68 117 85.04 68 69.27 140 84.58 101 82.03 11 83.88 88 71.85 120 76.20 196 84.68
y_test
Output:
199 83.50 119 75.55 186 85.10 27 75.65 57 86.65 145 85.15 161 79.49 169 83.08 23 75.02 191 70.51 130 73.19 28 74.15 42 71.10 177 73.64 154 78.45 164 82.04 108 74.25 115 74.44 87 81.74 89 84.60 95 76.48 92 72.08 70 71.80 35 76.76 143 75.52 41 86.41 104 77.55 175 71.11 5 81.18 163 77.07 181 71.83 2 78.68 117 85.04 68 69.27 140 84.58 101 82.03 11 83.88 88 71.85 120 76.20 196 84.68 Name: student_marks, dtype: float64
Now print the shape of train set and set set
x_train.shape
Output:
(160, 1)
y_train.shape
Output:
(160,)
x_test.shape
Output:
(40, 1)
y_test.shape
Output:
(40,)
Conclusion: By following these steps, you’ve successfully mastered the Train-Test Split technique using Sklearn in Python. Understanding data splitting is crucial for building robust and reliable machine learning models. Experiment with different datasets and model configurations to deepen your understanding and enhance your machine learning skills.