This is the final project of group 2 for the Seminar Introduction to Machine Learning (03SMBOEC0385). Our group used Google Colab to simplify collaboration. To load the data we used !git clone to clone the repository with the data on Google Colab. In case you want to run the jupyter notebook on your own device instead, you can also download the data from this repository and change the filepath in pd.read_csv() to the filepath of the data.
- Team Members
- Repository Structure
- Project Overview
- Key Features
- Models Evaluated
- Getting Started
- Results
- Future Work
- Jan Heinrich Schlegel
- Robert Bibaj
- Simon Klaassen
- Thomas Meier
├── Data
│ ├── 2014_Financial_Data.csv
│ ├── 2015_Financial_Data.csv
│ ├── 2016_Financial_Data.csv
│ ├── 2017_Financial_Data.csv
│ └── 2018_Financial_Data.csv
├── Documents
│ ├── ML_in_Finance.pdf
│ └── ML_in_Finance_Presentation.pdf
├── ML_in_Finance_Group2_FinalProject.ipynb
└── README.md
Our project explores the application of various machine learning models to predict stock performance based on historical financial data from 2014 to 2018. The project entails data preprocessing, feature engineering, model selection, and evaluation to address challenges such as class imbalances, missing values, and outliers. We rigorously tested models including Logistic Regression, Gaussian Naive Bayes, Random Forest, XGBoost, Support Vector Machine, and Feedforward Neural Networks to identify the most effective predictors of stock recommendations.
- Data concatenation and preprocessing to handle missing values and outliers.
- Implementation of KNN imputation and Isolation Forest for data cleaning.
- Feature engineering based on the Sustainable Growth Model and financial ratios.
- Evaluation of model performance using the weighted F1-score as the primary metric.
- Analysis of class imbalances with techniques like RandomOverSampler for balanced training.
- Logistic Regression
- Gaussian Naive Bayes
- Random Forest
- XGBoost
- Support Vector Machine (SVM)
- Feedforward Neural Networks with multiple hidden layers and dropout
To run the analysis:
- Ensure you have Jupyter Notebook or JupyterLab installed.
- Clone the repository and navigate to the project folder.
- Open
ML_in_Finance_Group2_FinalProject.ipynb
in Jupyter. - Install required Python packages listed in the beginning of the jupyter notebook.
- Execute the notebook cells sequentially to replicate our findings.
Our findings indicate that while traditional and simple models like logistic regression perform adequately, gradient boosting models like XGBoost, when finely tuned, can outperform more complex algorithms on tabular data. The project underlines the importance of feature engineering and model selection in financial data analysis.
The project opens avenues for further exploration, such as portfolio performance analysis based on model predictions, extension to non-US stocks, and advanced feature engineering to mitigate data leakage and enhance model accuracy.