World Cup Cricket Prediction Machine Learning Model

This Jupyter Notebook illustrates a machine learning pipeline for analyzing World Cup match results. The pipeline consists of data loading, data preprocessing, feature selection, model training, hyperparameter tuning, and evaluation of a Random Forest classifier.

Dataset Description

The analysis utilizes two datasets:

Match Results: Sourced from a CSV file named results.csv, which contains historical match data, including:
- Date of the match
- Teams involved (Team_1 and Team_2)
- Winner of the match
- Margin of victory
- Venue (Ground)
Team Statistics: Obtained from another CSV file named World_cup_2023.csv, which includes:
- Team names
- Team rankings
- Number of titles won
- Win percentages in One Day Internationals (ODI)
- World Cup match statistics (matches played, won, lost, etc.)
- Recent points and ratings

Load the Dataset

The datasets are imported using Pandas:

import pandas as pd 
match_results = pd.read_csv('results.csv') 
team_statistics = pd.read_csv('World_cup_2023.csv')

Feature Selection

The following features are selected for the machine learning model:

Encoded team identifiers for both teams
Team rankings
Win percentages in ODIs and World Cups
Recent points and ratings

The target variable is the encoded winner of the match.

Model Training and Evaluation

Train-Test Split

The dataset is split into training and testing sets to evaluate the model’s performance

Machine Learning Pipeline

A machine learning pipeline is created using Scikit-learn, which includes:

StandardScaler for feature scaling
RandomForestClassifier for classification

from sklearn.pipeline import Pipeline 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.preprocessing import StandardScaler  
pipeline = Pipeline([     ('scaler', StandardScaler()),     ('classifier', RandomForestClassifier(random_state=42)) ])

Hyperparameter Tuning

GridSearchCV is used to find the optimal hyperparameters for the Random Forest model. The parameters tuned include:

from sklearn.model_selection import GridSearchCV 
param_grid = {'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20, 30], 'classifier__min_samples_split': [5, 10],     'classifier__min_samples_leaf': [2, 4],'classifier__bootstrap': [True, False], 'classifier__max_features': ['log2', 'sqrt'] } 
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1) grid_search.fit(X_train, y_train) 
best_model = grid_search.best_estimator_

Model Fitting

The model is fitted to the training data, and predictions are made on the test set:

y_pred = best_model.predict(X_test)

Model Evaluation

Accuracy and classification report are used for assessing model performance:

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

This structured documentation provides a clear and concise overview of the analysis of World Cup match results using machine learning, detailing each step of the process from data loading to model evaluation.