This Jupyter Notebook illustrates a machine learning pipeline for analyzing World Cup match results. The pipeline consists of data loading, data preprocessing, feature selection, model training, hyperparameter tuning, and evaluation of a Random Forest classifier.
Dataset Description
The analysis utilizes two datasets:
-
Match Results: Sourced from a CSV file named
results.csv, which contains historical match data, including:- Date of the match
- Teams involved (Team_1 and Team_2)
- Winner of the match
- Margin of victory
- Venue (Ground)
-
Team Statistics: Obtained from another CSV file named
World_cup_2023.csv, which includes:- Team names
- Team rankings
- Number of titles won
- Win percentages in One Day Internationals (ODI)
- World Cup match statistics (matches played, won, lost, etc.)
- Recent points and ratings
Load the Dataset
The datasets are imported using Pandas:
import pandas as pd match_results = pd.read_csv('results.csv') team_statistics = pd.read_csv('World_cup_2023.csv')Feature Selection
The following features are selected for the machine learning model:
- Encoded team identifiers for both teams
- Team rankings
- Win percentages in ODIs and World Cups
- Recent points and ratings
The target variable is the encoded winner of the match.
Model Training and Evaluation
Train-Test Split
The dataset is split into training and testing sets to evaluate the model’s performance
Machine Learning Pipeline
A machine learning pipeline is created using Scikit-learn, which includes:
StandardScalerfor feature scalingRandomForestClassifierfor classification
from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ])Hyperparameter Tuning
GridSearchCV is used to find the optimal hyperparameters for the Random Forest model. The parameters tuned include:
from sklearn.model_selection import GridSearchCV param_grid = {'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20, 30], 'classifier__min_samples_split': [5, 10], 'classifier__min_samples_leaf': [2, 4],'classifier__bootstrap': [True, False], 'classifier__max_features': ['log2', 'sqrt'] } grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1) grid_search.fit(X_train, y_train) best_model = grid_search.best_estimator_Model Fitting
The model is fitted to the training data, and predictions are made on the test set:
y_pred = best_model.predict(X_test)Model Evaluation
Accuracy and classification report are used for assessing model performance:
from sklearn.metrics import accuracy_score, classification_report
2
3print("Accuracy:", accuracy_score(y_test, y_pred))
4print("Classification Report:\n", classification_report(y_test, y_pred))This structured documentation provides a clear and concise overview of the analysis of World Cup match results using machine learning, detailing each step of the process from data loading to model evaluation.