Analysis of Cricket World Cup Games & Predict the winner in each match.
This project is based on result.csv, icc_rankings.csv, fixtures.csv
This data set consist of ICC Cricket world cup 2019 stats like matches, teams. past matches, data for the winning teams from 2010. This data set is used in logistic regression algorithm to predict the winning team of 2019 Cricket World Cup played in England.
Dataset reference : https://www.kaggle.com/saadhaxxan/cwc-world-cup-2019-prediction-data-set
The data was dropped and scaled (StandardScaler, MinMaxScaler, RobustScaler) to go through data preprocessing.
results.dropna(axis = 0, inplace = True)
results.drop(['Ground'], axis = 1, inplace = True)
It created and used an automation function that scales with a specified scale.
def scaler_team(df_team, df, scaler):
value = []
for i in df_team.index:
value.append(df.loc[i]['Margin'])
value = np.array(value).reshape(-1, 1)
scaled = scaler.transform(value)
return np.mean(scaled)
def scaling(scaler):
wickets_np = df_wickets['Margin'].to_numpy().reshape(-1, 1)
scaler.fit(wickets_np)
.
.
.
return wicket_scaler, runs_scaler
We used a total of four algorithms: Linear regression, logistic regression, random forest, and knn.
We used linear regression first. However, linear regression showed approximately 45% accuracy.
We thought about the reason why the prediction result of linear regression is not good. The reason why the results of using linear regression were not good was that there were two categories: winning or losing. Therefore, we used logistic regression that is more appropriate for categorical data than linear regression.
accuracy result:
As a result of changing the algorithm, the accuracy has increased from about 45% to about 71%.
After that, we have created a function to find the best combination that predicts the best results.
def best_combi(df_teams_minmax, df_teams_standard, df_teams_robust):
return max(best_kfold_reg(df_teams_minmax), max(best_kfold_reg(df_teams_standard), best_kfold_reg(df_teams_robust)))
As a result, we found that combination with MinMaxScaler and Kfold of Logistic regression predicted the highest accuracy of 94%.
Therefore, we predicted the result of the game using the best combination we got earlier.
Using the dataset of ic_rankings and fixtures, we printed out the winners for each game and found that 37 of the total 45 results were correct.