This Jupyter Notebook illustrates a machine learning pipeline for analyzing World Cup match results. The pipeline consists of data loading, data preprocessing, feature selection, model training, hyperparameter tuning, and evaluation of a Random Forest classifier.
Dataset Description
The analysis utilizes two datasets:
-
Match Results: Sourced from a CSV file named
results.csv
, which contains historical match data, including:- Date of the match
- Teams involved (Team_1 and Team_2)
- Winner of the match
- Margin of victory
- Venue (Ground)
-
Team Statistics: Obtained from another CSV file named
World_cup_2023.csv
, which includes:- Team names
- Team rankings
- Number of titles won
- Win percentages in One Day Internationals (ODI)
- World Cup match statistics (matches played, won, lost, etc.)
- Recent points and ratings
Load the Dataset
The datasets are imported using Pandas:
import pandas as pd
match_results = pd.read_csv('results.csv')
team_statistics = pd.read_csv('World_cup_2023.csv')
Feature Selection
The following features are selected for the machine learning model:
- Encoded team identifiers for both teams
- Team rankings
- Win percentages in ODIs and World Cups
- Recent points and ratings
The target variable is the encoded winner of the match.
Model Training and Evaluation
Train-Test Split
The dataset is split into training and testing sets to evaluate the model’s performance
Machine Learning Pipeline
A machine learning pipeline is created using Scikit-learn, which includes:
StandardScaler
for feature scalingRandomForestClassifier
for classification
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ])
Hyperparameter Tuning
GridSearchCV is used to find the optimal hyperparameters for the Random Forest model. The parameters tuned include:
from sklearn.model_selection import GridSearchCV
param_grid = {'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20, 30], 'classifier__min_samples_split': [5, 10], 'classifier__min_samples_leaf': [2, 4],'classifier__bootstrap': [True, False], 'classifier__max_features': ['log2', 'sqrt'] }
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1) grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
Model Fitting
The model is fitted to the training data, and predictions are made on the test set:
y_pred = best_model.predict(X_test)
Model Evaluation
Accuracy and classification report are used for assessing model performance:
from sklearn.metrics import accuracy_score, classification_report
2
3print("Accuracy:", accuracy_score(y_test, y_pred))
4print("Classification Report:\n", classification_report(y_test, y_pred))
This structured documentation provides a clear and concise overview of the analysis of World Cup match results using machine learning, detailing each step of the process from data loading to model evaluation.