[PROJECT 2 - GROUP 1]

EXECUTIVE SUMMARY: Overview and Objectives

The Summer Olympic Games represent a pinnacle of international sports competition, where athletes from around the world compete for glory and national pride. This project aims to predict the medal counts for the USA in the Summer Olympics for the years 2016, 2020, and 2024. Using historical data and machine learning models, the objective is to provide accurate predictions of the number of gold, silver, and bronze medals the USA will secure in these Olympic events.

The primary objective of this project is to develop predictive models that can estimate the medal counts for the USA. By leveraging machine learning algorithms, specifically LinearRegression, RandomForestRegressor, and XGBoostRegression, the goal is to achieve precise predictions. Additionally, feature importance analysis is conducted to identify the most significant predictors of medal success among the available attributes.

OVERVIEW OF THE DATA COLLECTION, CLEAN UP AND EXPLORATION PROCESS

The dataset used for this project is sourced from Kaggle

Preprocessing Steps

Data Cleaning

Handling missing values by imputing relevant statistics (e.g., mean or median) for numerical columns.
Dropping non-relevant columns such as 'Athlete', 'City', 'Discipline', and 'Event'.

Data Encoding

Encoding categorical variables using One Hot Encoding for features like event type and athlete nationality.

Data Splitting

Splitting the dataset into training and testing sets using the train_test_split function.

Model Training and Evaluation

Linear Regression

Creating a pipeline with One Hot Encoder and Linear Regression.

Random Forest Regressor

Creating a pipeline with One Hot Encoder and Random Forest Regressor.

XGBoost Regressor

Creating a pipeline with One Hot Encoder and XGBoost Regressor.

Each model was evaluated using metrics such as Mean Squared Error (MSE), R-squared (R²), and Adjusted R-squared.

Observations and Results

Linear Regression

Provided a solid baseline with reasonable R-squared values.

Random Forest Regressor

Showed improved performance with a lower Mean Squared Error.

XGBoost Regressor

Achieved the highest accuracy among the evaluated models with the best R-squared value.

Results and Conclusions

Linear Regression provided a solid baseline with an R-squared value of xxx.
Random Forest Regression improved the prediction accuracy with an R-squared value of xxx.
XGBoost Regression achieved the highest accuracy with an R-squared value of xxx.

Feature importance analysis indicated that past performance in specific events is a key predictor of medal counts. The models suggest an increase in the total medal count for the USA in the 2024 Olympics compared to 2016 and 2020.

matildawangh/example

[PROJECT 2 - GROUP 1]

EXECUTIVE SUMMARY: Overview and Objectives

OVERVIEW OF THE DATA COLLECTION, CLEAN UP AND EXPLORATION PROCESS

Preprocessing Steps

Data Cleaning

Data Encoding

Data Splitting

Model Training and Evaluation

Linear Regression

Random Forest Regressor

XGBoost Regressor

Observations and Results

Linear Regression

Random Forest Regressor

XGBoost Regressor

Results and Conclusions