The Summer Olympic Games represent a pinnacle of international sports competition, where athletes from around the world compete for glory and national pride. This project aims to predict the medal counts for the USA in the Summer Olympics for the years 2016, 2020, and 2024. Using historical data and machine learning models, the objective is to provide accurate predictions of the number of gold, silver, and bronze medals the USA will secure in these Olympic events.
The primary objective of this project is to develop predictive models that can estimate the medal counts for the USA. By leveraging machine learning algorithms, specifically LinearRegression, RandomForestRegressor, and XGBoostRegression, the goal is to achieve precise predictions. Additionally, feature importance analysis is conducted to identify the most significant predictors of medal success among the available attributes.
- The dataset used for this project is sourced from Kaggle
- Handling missing values by imputing relevant statistics (e.g., mean or median) for numerical columns.
- Dropping non-relevant columns such as 'Athlete', 'City', 'Discipline', and 'Event'.
- Encoding categorical variables using One Hot Encoding for features like event type and athlete nationality.
- Splitting the dataset into training and testing sets using the
train_test_split
function.
- Creating a pipeline with One Hot Encoder and Linear Regression.
- Creating a pipeline with One Hot Encoder and Random Forest Regressor.
- Creating a pipeline with One Hot Encoder and XGBoost Regressor.
Each model was evaluated using metrics such as Mean Squared Error (MSE), R-squared (R²), and Adjusted R-squared.
- Provided a solid baseline with reasonable R-squared values.
- Showed improved performance with a lower Mean Squared Error.
- Achieved the highest accuracy among the evaluated models with the best R-squared value.
- Linear Regression provided a solid baseline with an R-squared value of xxx.
- Random Forest Regression improved the prediction accuracy with an R-squared value of xxx.
- XGBoost Regression achieved the highest accuracy with an R-squared value of xxx.
Feature importance analysis indicated that past performance in specific events is a key predictor of medal counts. The models suggest an increase in the total medal count for the USA in the 2024 Olympics compared to 2016 and 2020.