Predicting Stock Data with an LSTM Network

Abstract

Do you want to GET RICH QUICK using data science and machine learning to predict stock prices? Going this route has many heartaches, and decision trees to traverse such as.

Drop some sparse columns, or fill the missing data with 0?
Do you one hot encode an analyst saying "BUY", "HOLD", or map numerical values to each of the given features?
What if there are irregularities in the price data you have, how do you fix them?

All of these and more are valid questions we had to answer, and have a direct effect on the predictions of our networks. If in the end our predictions do gain profits we will become market movers, and eventually the market will work out the methods we are using to predict.

Our data was scraped using Selenium from an investment firm consisting of analyst opinions, performance statistics, prices, and company information for 7000+ stock symbols from August, 9th of 2019 to present. Although after cleaning our data, and dropping irregularities we end with roughly 2000 symbols.

We are predicting time series data, so we have to define things such as the number of days to predict the next with. The data is then transformed into multiple matrices of X_data correlating to y_targets. The X_data being all the data from n day(s) before, and the y_targets being the data we are trying to predict. If one wanted to know the information two days ahead they would have to predict all the data for one day then use the predicted data to predict the next, or structure the data in such a way where one day is being used to predict two.

Our first networks had infinite loss due to predicting everything as 0, so we had to devise a method for creating the best network to use on the data. There are few `plug-and-play` methods for tuning neural networks, and especially tuning Time Series predicting networks. The method we did find was a Hyperband from kerastuner. The Hyperband takes a build function and inside of the build function one can use a Hyperband choice function which reports back to the Hyperband what effect a given quotient had on the validation loss of the network. Our NetworkTuner can tune items such as:

n_input (number of days to use in the prediction)
Columns (which of the given columns to use in the prediction)
Scale or not to scale the data between 0 and 1
How many neurons to use in any given layer
How much regularization if any to use, and which type

With the Hyperband we also developed a cross validation method, as kerastuner does not supply one out of the bag for time series. Cross validation ensures that the parameters are not being tuned solely for one set of testing data. K validation sets are also held back throughout the tuning process to test the network at the end of tuning.

Using the OSEMN Process

Obtain the data
Scrub the data
Explore the data
Model the data
Interpret the data
Reference

This repository contains

A Jupyter notebook main.ipynb detailing my EDA and neural network processes
A technical presentation.pdf of the project.
A python script tune.py which is ran from a shell with:

python tune.py [name_of_network]
# name_of_network being a predefined name correlating to a 
# function that tunes that specific network on predefined 
# hyper-parameters

A Jupyter notebook Pull and clean data.ipynb for pulling and replacing all the pickles of data, refreshing data in the 'stock_cleaned' SQL server, and refreshing the Firebase database.
Folder db with files firebase.py and database.py for connecting to and posting to Google Firebase and our SQL server.
Folder modeling with files:
- build.py with class NetworkBuilder which takes parameters that directly correlate to how a network is put together. This class is also used for tuning those same parameters.
- create.py with class NetworkCreator that does everything from preparing the time series data to creating an html report on how well the model performed on the train, test, and validation data.
- sequential.py with class CustomSequential for wrapping a keras Sequential model and overriding its fit function to implement a custom cross validation method.
- tuner.py with class NetworkTuner for tuning a neural network's architecture, and data processing methods.
Folder old (unorganized) with files:
- Old Modeling.ipynb Which is a Jupyter Notebook where I failed to predict on all the data
- Old main.ipynb Which is my original Jupyter Notebook containing the scrubbing process, and attempts at modeling
- Old main2.ipynb Which is a Jupyter Notebook showing my attempt at predicting all the data from three sides before realizing it was impossible with my single GPU, and that company info is irrelevant there since it is unchanging.
- Pull and update data.ipynb Which is an almost working notebook for updating the data rather than pulling it all and updating everything.
- scratch.ipynb Which is a Jupyter Notebook showcasing where I really dug into Time Series data, exactly what the generator was doing, and forecasting.
Folder reports containing HTML reports of how each model performed, and which columns directly effected their performance.
File styles/custom.css containing the css used to style the jupyter notebooks
Folder test_notebooks (unorganized) with files:
Firebase Test.ipynb Which is a Jupyter Notebook
Prediction_testing.ipynb Which is a Jupyter Notebook testing predictions with my old method of greek god named models.
dashboard_test.ipynb Which is a Jupyter Notebook with my first tests of plotly graphs and my scraped data for my Website
model_scratch_testing.ipynb Which is a Jupyter Notebook containing the actual function tests that were used in the beginning development of my NetworkCreator class.

Scrubbing the data

Prices

Reindex to valid dates 2019-08-09 => onwards without the four days that have very little data '2019-12-09', '2020-06-23', '2020-06-24', '2020-06-25'
Forward interpolate the data with a limit of three days. So if 6-25 was a valid price, and the four days after were null it would fill the first three, but not the third
Drop symbols with null values
Post to stock_cleaned SQL server
Pickle

Splits

Make an apply column which is `num`/`den`

Performance

Load in performance and clean
Drop symbols not in price symbols
Match index to price index
Fill null ExDividend dates with 1970-01-01 then encode days since then for numerical data
Decide columns to fill, and columns to fill then drop if the symbol still has null values
Interpolate null values for both, fill Na for columns to fill
Drop columns with negative min that still have many null values
Drop symbols that still have null values in the columns with a negative minimum as filling with 0 not be adequate.
Add price to performance
Apply splits
Separate out penny stocks ( stocks where price is < 1 dollar )
Post to stock_cleaned SQL server
Pickle penny and non-penny performances

Company

Split out symbols that are in performance symbols
Fill null text values with `unknown`
Pickle.

Analyst

Interpolate null values by symbol, then fill the rest with 0
Map text values to numeric
Convert all to float
Post to stock_cleaned SQL server
Pickle

Combined Company/Analyst/Performance

One hot encode Company
Combine the three data frames into one

After the process is complete, we update Firebase for website with performance and performance penny, possibly company and analyst if added later.

Visualzations

Our SP500

This is an average of our SP500 prices you can clearly see the covid-19 dip in march

Comparing our scraped AAPL price to yahoo finance

As you can see below, our data is not perfect as it is only collected once per day, but we have many more features then we know what to do with.

AAPL difference of features compared to price

Here we difference each feature so that the 2nd day is now (1st day) subtracted from (2nd day) and so on. We then plot that on the same scale as price to see if there are any indicators of price jumps, and to check the vitality of our data.

About the plot:

You can see where AAPL had a price split when `VolumeAvg90Day` peaked.
Features such as `ReturnonEquity` are quarterly reports, thus they are showing difference on the quarters.

Positive and negative correlations to price

Pay special attention to the columns in red and green as they are showing negative and positive correlation to price the next day.

About the plot: You can see for VLO( Valero ) `PE`, in red has a negative correlation to price, and a positive correlation to price in green , `PriceToSales`.

Lack of overall correlations

Here we took the overall prediction quality of each column, and plotted their sum qualities for each symbol. The quality was determined by how well a given feature correlated to changes in all the other features.

About the plot: In red you can see the columns that we dropped. They are overall more normally distributed, thus not useful in predicting what tomorrow's price is going to be

First model

Base model predictions

About the plot!

The network is not doing too well at predicting the test or validation data
Drop in quality of the testing data is showing through, as AAPL had a split in September 2020

Manually tuned model predictions

After some slight manual tuning of the network here are the predictions.

About the plot!

Much better than the base model
You can see the same split drop in quality here on the testing data, maybe we could remove outliers.

Interpretation totals

Here we use shap to get how much each column is effecting the network

About the plot!

AAPL_PE is having the greatest effect on predicting the next day

Interpretation value

Here we use shap to get how each column is effecting the network

About the plot!

AAPL_TotalReturn1Yr going up is predicting that price will go up. This correlates to our correlation where AAPL_TR1YR was correlating positively to price.

Auto-tuned model predictions

Coming soon...

Hyper-parameter tuning methodology

Here we will do a simple walk through of the Hyper-parameter tuning process.

tune.py

Here is a small test for the tune.py you can see that there are many different parameters defined such as:

input_neurons corresponds to how many neurons will be used in the input layer.

n_days corresponds to how many days are used for predicting the next, or the length of the TimeSeriesGenerator.

import copy
import pandas as pd

# Import our NetworkTuner
from modeling.tuner import NetworkTuner

if __name__ == "__main__":
    
    # Define parameters to tune
    parameters = {
        'input_neurons': [2, 4, 8, 16],
        'input_dropout_rate': [.1, .3, .5],
        'use_input_regularizer': [0, 1, 2],
        'input_regularizer_penalty': [0.01, 0.05, 0.1, 0.3],
        'n_hidden_layers': [1, 3, 5, 8],
        'hidden_dropout_rate': [0.0, .3, .5, .9],
        'hidden_neurons': [16, 32, 64],
        'use_hidden_regularizer': [0, 1, 2],
        'hidden_regularizer_penalty': [0.01, 0.05, 0.1, 0.3],
        'patience': [5, 25, 50, 100],
        'batch_size': [32, 64, 128],
        'use_early_stopping': [0, 1],
        'n_days': [1, 2, 3]
    }
    
    # Build the test data frame
    _list = list(range(20))
    df = pd.DataFrame({
        'apple': copy.copy(_list),
        'orange': copy.copy(_list),
        'banana': copy.copy(_list),
        'pear': copy.copy(_list),
        'cucumber': copy.copy(_list),
        'tomato': copy.copy(_list),
        'plum': copy.copy(_list),
        'watermelon': copy.copy(_list)
    })
    
    # Define which columns are feature(s) and which are the target(s)
    X_cols = list(df.columns)
    y_cols = 'banana'


# On the instantiation of NetworkTuner our data is split
# into k many folds, and then each fold is split again into
# training, testing, and validation data.

    # Instantiate our NetworkTuner
    nt = NetworkTuner(
        df=df, X_cols=X_cols,
        y_cols=y_cols, k_folds=5, max_n_days=3
    )
    
    # Call the tune function
    nt.tune(
        'Albert', max_epochs=100
    )

nt.tune

When nt.tune is ran the following function is called from modeling.NetworkTuner

def tune(self, name, max_epochs=10, **parameters):
    """Running the tuner with kerastuner.Hyperband"""

    # Feeding parameters to tune into the build function
    # before feeding it into the Hyperband
    self.build_and_fit_model = partial(
        self.build_and_fit_model, **parameters
    )

    # Register Logger dir and instantiate kt.Hyperband
    Logger.register_directory(name)
    tuner = kt.Hyperband(self.build_and_fit_model,
                            objective='val_loss',
                            max_epochs=max_epochs,
                            factor=3,
                            directory='./tuner_directory',
                            project_name=name,
                            logger=Logger)

    # Start the search for best hyper-parameters
    tuner.search(self)

    # Get the best hyper-parameters
    best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

    # Display the best hyper-parameters
    print(f"""The hyperparameter search is complete.
    The optimal number of units in the first densely-connected layer
    {best_hps.__dict__['values']}
    """)

NetworkBuilder

The NetworkBuilder has a series of functions such as the below for searching the different parameters, getting each selection from the Hyperband. Here is a small cut-out of our input layer showcasing where the Hyperband makes choices.

input_neurons = hp.Choice('input_neurons', input_neurons)
model.add(LSTM(input_neurons))

tuner.search

As the tuner begins its search we move to our CustomSequential that is used by the NetworkTuner as its primary network when tuning. The CustomSequential overrides tensorflow.keras.models.Sequantial fit function to implement a cross-validation split. A simplified version of our CustomSequential.fit is defined as follows:

def fit(self, nt, **kwargs):
    """
    Overrides model fit to call it k_folds times
    then averages the loss and val_loss to return back
    as the history.
    """

    histories = []
    h = None

    # Iterate over number of k_folds
    for k in range(1, self.k_folds+1):
        train, test, val = self.nt.n_day_gens[self.n_days][k]
        # Split data and targets
        X, y = train[0]
        X_t, y_t = test[0]

        # Calling Sequential.fit() with each fold
        h = super(CustomSequential, self).fit(
            X, y,
            validation_data=(X_t, y_t),
            **kwargs)
        histories.append(h.history)

    # Get and return average of model histories
    df = pd.DataFrame(histories)
    h.history['loss'] = np.array(df['loss'].sum()) / len(df)
    h.history['val_loss'] = np.array(df['val_loss'].sum()) / len(df)
    return h

Conclusion

Data can be manipulated in many different ways, and networks can be tuned in many different ways. To accurately predict the stock market one would have to come across a lucky set of hyper-parameters and training set that the big players have not tried on their huge servers. The parameters chosen would also not work forever.
Over time if you are trading in large volumes the market would become "used" to your predictions, and the market movers would start basing their predictions off of yours, and they would become useless.
Coming soon...

Next Steps

Cluster on absolute correlation takeing correlation for different symbols.

Tune network on which columns are being used for predictions.

Tune network with vs without difference data and/or scaling.

Forecast tomorrow's prices

Folder Structure

\--- bin
|
\--- db
|
\--- img
|
\--- modeling
\--- \--- tests
\         \--- _python
\         \--- create
\         \--- tuner
|   
\--- old
|
\--- reports
\    \--- aapl_price_w_aapl_info
\    \--- aapl_price_w_all_price
\    \--- aapl_price_w_sector
|
\--- styles
|
\--- test_notebooks

Repository Structure

\--- bin
\    |  __init__.py
\    |  anomoly.py
\    |  database-schema.py
\    |  NN.py
\    |  out.png
\    |  correlation data csv files
|
|
\--- db
\    |  __init__.py
\    |  database.py
\    |  firebase.py
|
|
\--- img
\    |  flow.png
|
|
\--- modeling
\--- \--- tests
\         \--- _python
\              |  test_param_setting.py
\
\         \--- create
\         \--- tuner
\              |  test_cv.py
\              \-  val_folds
\    |  __init__.py
\    |  build.py
\    |  create.py
\    |  sequential.py
\    |  tuner.py
|   
|
\--- old   
\    |  Old main.ipynb
\    |  Old main2.ipynb
\    |  Old model_creation.ipynb
\    |  Old Modeling.ipynb
\    |  Pull and update data.ipynb
\    |  scratch.ipynb
\    |  scratch.py
|
|
\--- reports
\    \--- aapl_price_w_aapl_info
\    \--- aapl_price_w_all_price
\    \--- aapl_price_w_sector
|
|
\--- styles
\   |  custom.css
\   |  
|
|
\--- test_notebooks
\    |  dashboard_test.ipynb
\    |  Firebase Test.ipynb
\    |  model_scratch_testing.ipynb
\    |  Prediction_testing.ipynb
|
|  .gitignore
|  main.ipynb
|  presentation.pdf
|  Pull and clean data.ipynb
|  Readme.ipynb
|  README.md
|  run_tests.py
|  todo.txt
|  tune.py

skelouse/capstone-stock-analysis

Table of Contents

Predicting Stock Data with an LSTM Network

Abstract

Using the OSEMN Process

This repository contains

Scrubbing the data

Prices

Splits

Performance

Company

Analyst

Combined Company/Analyst/Performance

Visualzations

Our SP500

Comparing our scraped AAPL price to yahoo finance

AAPL difference of features compared to price

Positive and negative correlations to price

Lack of overall correlations

First model

Base model predictions

Manually tuned model predictions

Interpretation totals

Interpretation value

Auto-tuned model predictions

Hyper-parameter tuning methodology

tune.py

nt.tune

NetworkBuilder

tuner.search

Conclusion

Next Steps

Folder Structure

Repository Structure