/stock-market-prediction-via-google-trends

Njord attempts to predict future stock prices based on Google Trends data—using machine learning.

Primary LanguagePythonMIT LicenseMIT

Project logo

Njord

Status License Website


Njord attempts to predict future stock prices based on Google Trends data—using machine learning.

Table of Contents

About

The data used by Njord is downloaded from Google Trends. The concept for this project came from research by Tobias Preis, Helen Susannah Moat, and H. Eugene Stanley, "Quantifying Trading Behavior in Financial Markets Using Google Trends". In this research was found that the search volume for certain (financial) words are linked to the stock price of the Dow Jones Industrial Average stock price, and can in most cases predict a dip in the market. The purpose of this project is to combine this research with machine learning.

Project Status

This project is currently under development.

Installation

TODO

Reflection

TODO

Results

Two machine learning algorithms have been explored for this project: XGBoost and MLPClassifier. The MLPClassifier clearly performed better than XGBoost. The best annual return, which XGBoost got is 44.2%. In contrast, MLPClassifier's best model got a 91.3% between 2008 and the present. A big contribution towards these insanely high annual returns was the coronavirus. Because of the coronavirus, the stock market crashed, which could be a major source of profits for these algorithms.

MLPClassifier

MLPClassifier performed very well on the test data. This algorithm was very strong in identifying that it was impossible to predict the small changes in the market in between crashes. Thus, for the most part, it held a buy-and-hold strategy, but during a stock market crash (like corona) or other, slightly bigger, changes, it performed well. As can be seen in figure 8.

Comparison of the MLPClassifier, 10.000 random and a buy-and-hold strategy

Figure 8. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm and a buy-and-hold strategy.

XGBoost

XGBoost did not have the insight, which MLP did. It tried to predict the small changes, which it ultimately failed at. However, XGBoost was still able to predict the stock market crash caused by the coronavirus. This was the reason why XGBoost still had such a large annual return (44.2%).

Comparison of the MLPClassifier, 10.000 random and a buy-and-hold strategy

Figure 9. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm, XGBoost algorithm and a buy-and-hold strategy.

Data

Data Collection

Two datasets were needed for this project; the Google Trends daily data for a specific keyword and the stock price daily data for a specific ticker. To collect the Google Trends daily data, you have to download all 6-month increments, 5-year increments, and 2004—present within the 2004—2020 timespan. All this data will eventually be adjusted to be relative to each other, instead of only within its respective timespan. To collect the stock price daily data for a specific ticker you want to predict, you have to download it from a website like Yahoo Finance, where you can download the historical data of any ticker.

Data Visualisation

Correlation

To prove that there indeed is a correlation between Google Trends data (e.g. 'debt') and stock prices (e.g. Dow Jones Industrial Average). I plotted the DJIA stock price with indicators of peaks in the search volume for "stock market". As you can see, before a major stock market crash, there are usually some peaks to be observed. There are also some peaks in the middle of a crash, but the peaks before the crash are quite indicative.

DJIA stock price data with peak-indicators of 'stock market'.

Figure 1. A graph where the stock price of DJIA is plotted with red dots where a peak in search volume for "stock market" has been observed. From this graph can be observed that erratic movement in search volume precedes a major stock crash.

Adjusted

After all adjustments of the data to eventually get relative daily data, which is relative to each other, the data visually looks as follows:

Adjusted daily data over entire timespan.

Figure 2. A graph in which the adjusted daily data is visualised.

Restrictions

All data on Google Trends is relative (0—100) to each other within one timeframe and you can only get daily data in 6-month increments, weekly data in 5-year increments, and only monthly data is provided for the entire timespan available. So to aggregate all data needed for this project was quite a challenge and because of these restrictions aren't completely accurate, however, the method I used was the only method to getting daily data over the entire timespan available (which is crucial for this project).

Method

To get all the data relative to each other, instead of only within its 6-month increment, I had to merge them based on weekly data. However, the weekly data is only available in 5-year increments, so I had to merge these 5-year increments based on the monthly data, which is available for timespan needed for this project. To merge all the 6-month, and 5-year increments, I computed the percentage change of each data point within its respective increment. Afterwards, I got one data point from the higher up periodicity data per increment and computed the missing days by applying the percentage change to the provided data point.

Example

An example of the search term 'debt' ('debt' is the best search term to predict market change, according to the research mentioned earlier) in the timespan 2007—2009:

Before adjustments

Before adjustments of example.

Figure 3. A graph where the unadjusted relative daily data is visualised. The black vertical lines indicate the edges of the 6-month increments.

After adjustments

After adjustments of example.

Figure 4. A graph where the adjusted relative daily data is visualised. The graph follows the actual weekly data much better.

Weekly

Actual monthly data.

Figure 5. The actual weekly data.

Features

To get better results, the raw data had to be feature engineered. Features used include:

Following the computation for these features, all of them are shifted 3 through 10 days. This is because Google Trends data is available three days after the fact and the target may correlate well with further shifted data. Afterward, there are 272 features. The top 50 correlating (with the target, according to the Pearson correlation coefficient) are used in the training and predicting of the direction of the Dow Jones Industrial Average.

Simple Moving Average Delta

SMA delta.

Figure 6. When this feature becomes more volatile, the close price follows. This is a good indicator for a machine learning algorithm. It can also be seen that the close price percentage change loosely follows the line of the feature.

Bollinger Bands

Bollinger bands.

Figure 7. When the 20-day simple moving average crosses the upper Bollinger band, the close price becomes more volatile. The stock close percentage change also loosely follows the lower Bollinger band.

Project Organisation

    ├── LICENSE
    ├── Makefile           <- Makefile with commands like `make data` or `make train`
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
    │                         the creator's initials, and a short `-` delimited description, e.g.
    │                         `1.0-jqp-initial-data-exploration`.
    │
    ├── references         <- Data dictionaries, manuals, and all other explanatory materials.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    └── src                <- Source code for use in this project.
        ├── __init__.py    <- Makes src a Python module
        │
        ├── data           <- Scripts to download or generate data
        │   └── make_dataset.py
        │
        └── features       <- Scripts to turn raw data into features for modeling
            └── build_features.py

MIT License

Copyright (c) 2020 Cristian Perez Jensen