This project is about publishing (or tweeting) forecasts of 7 models on stocks in the financial market everyday. Users can see the predictions and ask for predictions of the companies they want.
Some of the main technologies and packages used:
- Web Scraping using pandas
- Machine Learning with sklearn and scikit-optimze
- Twitter Bot on Tweepy API
- Deploy: I'm looking for another online hosting platform
You can see the final project here, the Twitter account of Regress.
- regress_bot.py and funcs.py: Python program that is running in PythonAnywhere, and contains the bot code and the models training.
funcs.py
is a file that contais functions that I used in theregress_bot.py
program. - companies.txt and last-mention-id.txt: Example of text files used by the Regress Bot to store the companies that it predicts and tweets, and the last tweet that mentioned it's account.
- report.csv: Example of a report that's updated everyday by Regress Bot.
The program train 7 models everyday, tuning their hyperparameters. They are: Stochastic Gradient Descent, Ridge Regression, Linear Support Vector Regressor, K-Nearest Neighbors, Random Forest, Ada Boost and MLP. The models are trained with the last 30 days, and tested with the last 5 - the best 3 are chosen (based on their RMSE, Root Mean Squared Error) to tweet the predictions. As I observed, Linear SVR and SGD are the best ones.
Model | Tuned Hyperparameters |
---|---|
Stochastic Gradient Descent | Penalty, Alpha and Learning Rate |
Ridge Regression | Regularization |
Linear Support Vector Regression | Regularization |
Regression based on k-NN | Number of Neighbors and Weights |
Random Forest Regressor | Number of Trees |
Ada Boost | Number of Estimators and Learning Rate |
Multi-layer Perceptron Regressor | Activation Function and Learning Rate |
As the data is a time series, the ideal would train models like Arima or Prophet, from Facebook.
I chose to use more "classic" models, because I wanted to see how these models would perform. As the next step, you could see how time series models would predict. Furthermore, as the program is hosted in a free platform, PythonAnywhere, it becomes impracticable train models with a lot of data - that's why I opted for 30 days (and using more days - I got to use 5 years - to train the models increased their RMSE, that's interesting). Maybe, with more recent data, the models learn stronger relations between the features and the label, one the recent data reflects better the today data.
Maybe, training a MLP (Multi-Layer Perceptron) in a small dataset is disproportionate - but I wanted to see how it would perform -, however, it seems to be a good model.
More details of the models training and selection is in regress_bot.py
and funcs.py
files. They are trained everyday with new data, that is scraped off the Yahoo Finance website, the hyperparameters are tuned with skicit-optimize, and their predictions are shown on the @RegressML account on Twitter.
My Data Science portfolio: link
My LinkedIn: link
Bruno Kenzo, 18 yo.