Closing price prediction of Jakarta Composite Index using Autoencoders and LSTM model

By Zaky Riyadi

In this repo, Im going to show you how we can use Autoencoders and LSTM model to predict the closing price of the Jakarta Composite Index (JKSE) from the historical and technical indicators data. Predicting or forecasting the Closing price is part of a time-series problem, where every observation is time-dependent, and the output data are continuous. There are many technical indicators that can help us to make better decisions on whether to buy or sell the stock. Technical indicators are pattern-based signals calculated from the Historical data (Open, Closing, High, Low, and Volume) and have many purposes ranging from measuring volatility, momentum, trend and volume.

In this study, I'm going to use 37 of technical indicators with an additional of 5 historical data to help predict the closing price of JKSE. Now you may ask,

"Wouldn't that be too many features and may create dimensionality problems?"

and I'd say

"yes! And that's where Autoencoder comes in"

Why are we using Autoencoders?

Using too many features may result in a dimensionality problem, where the model starts to overfit. Overfitting is when the model learns too well from the training dataset and fails to generalize well for unseen real-world data (testing dataset). Therefore, in this study, Autoencoders is used as a Feature extraction method for dimensionality reduction by transforming the data into lower dimensions. Autoencoder is an unsupervised artificial neural network that tries to encode the data by compressing it into the lower dimensions (also known as the bottleneck layer/ cell) and then decoding the data to reconstruct the original input. The bottleneck layer holds the compressed representation of the input data or the data that has reduced dimension. Shown in Figure 1 is the diagram of Autoencoder.

autoencoders

Figure 1: generalization of Autoencoder. [ref]

Why are we using LSTM model?

Long Short-Term Memory or LSTM is a Recurrent Neural Network (RNN) commonly used for time series prediction. The advantage of using RNN is allowing previous outputs to be used as inputs while having hidden states which enables RNN to remember previous information. RNN has memory, which helps to remember all the information about what has been calculated. Therefore it is very advantageous for solving problems where the observation depends on a sequence, such as in time-series and NLP. However, RNN has many limitations, including:

  1. Short-term memory: forgetting the earliest information when moving to later ones
  2. Vanishing gradient: the gradient becomes very small, preventing the neural network from stopping learning.
  3. Exploding gradient: the network assigns unreasonably high importance to the weights.

LSTM encounter these issues by having a cell state (or long-term memory) which runs through the chain with only linear interaction, keeping information flow unchanged. Every cell state has a gate mechanism (Input, output, and forget gate) that decides whether to keep or omit information. It is a way to pass the information selectively that consists of the sigmoid layer, hyperbolic tangent layer, and point-wise multiplication operation.

You can read the original publication here to learn more about the API and the math here or you can read a simpler version here or if you just want to watch someone to clearly explain LSTM you can watch here.

LSTM

Figure 2: LSTM cell [ref]

Data preparation and analysis

Now, Once we get through the concept and why we are using Autoencoders and LSTM, we can finally start the analysis.

1. First of we need to get the dataset. Im extracting the historical data from yahoo finance. We can extract the data directly using python and by following Figure 3. I have specified to use the historical data range from 01/01/2003 (period 1) to 01/09/2022 (period 2). Once we load the dataset, we can convert it from CSV to Dataframe, make the Date as the index and remove any Null values.

load_data

Figure 3: Load the data

  1. Next, check for any outliars data by displaying all of the features and observe the statistical data from univeriate analysis.

lot between Historical data vs date

Figure 4: Plot between Historical data vs date

  1. Once we have removed all of the outliers data, we can start calculating the technical analysis using the Technical Analysis library here. In this study, Im using 37 technical analysis indicator based on the trend, volatility, volume and momentum indicator.

TA calculation

Figure 5: All of the calculated TA

  1. Now, let's observe the correlation between every feature using Spearman's correlation. Based on the correlation (Figure 6 & 7), there are too many redundant features that are unnecessay and can highly influence the prediction's accuracy. Therefore, let's finally use Autoencoders!

rnn_step_forward

Figure 6: Spearman correlation between features

rnn_step_forward

Figure 7: Features Spearman correlation with the target data (Close price)

  1. But Hold on!, before we input our features into Autoencoders, we need to rescale the data from the original range so that all values are within the new range of 0 and 1 by using MinMaxscaler() and additionally, we need to remove the Close price to better represent the new attributes/features.

Autoencoders

  1. Once we know that our data is "clean" (meaning no outliers) and are scaled, we can start to build our Autoencoders. Shown in Figure 8 is the model summary of the Autoencoders. Here, Im using two layers of encoder and decoders. Where the first encoder reducess the dimension from 42 (Number of original features) to 20 and subsequently to 10. The bottleneck layers (or code) are the layers that we want to extract, which it is reduced into 4 attributes.

The model summary for Autoencoders

Figure 8: The model summary for Autoencoders

"But wait", you may asked.

How about Decoder layers? Why are we not using it?

Decoder layers are commonly represented as a mirror with the encoder layers. Meaning the input and the output must have a similar number of dimensions. It is a common practice to have a mirror-like shape (E.g. Encoder 1 = Decoder 2, Encoder 2 = Decoder 1). Hence, if you look at the shape between encoder and decoder are similar. Additionally, the output of the decoders layer are commonly used to compress the image or want to have identical representative features as the input. However, since our objective is to reduce the dimension, we only extract the attributes from the Bottleneck layer that have decreased from 42 to 4.

  1. Observing the train loss and the val loss (Figure 9), they shows a good fit, where both loss decrease and stabilize at similar epoch (at epoch 26).

Autoencoders loss

Figure 9: Loss vs epoch

Now let's evaluate the new attributes. I have renamed the newly generated attributes into attributes 1 to 4. Shown in Figure 10 is the Pearson correlation. The result demonstrates that attributes 3 and 4 are very well correlated with the closing price, and by plotting the line graph between close, attribute 3 and attribute 4, we can see the monotonic trend between them (Figure 11). Since attributes 1 and 2 could not represent the features very well, let's remove them and only use attributes 3 and 4 to predict the closing price.

The model summary for Autoencoders

Figure 10: Spearman correlation between attribute 1–4 to Closing price

Line plot

Figure 11: Line plot between Close, attribute 3 and attribute 4

LSTM

  1. Now, let's finally predict the closing price using LSTM. First, I have split the dataset into train and test sets based on the date. My training set is from 23/10/2003 to 01/05/2021, while from 01/05/2021 to 01/09/2022 is my test set. Next, we scale the features and target data using MinMaxscaler().

Line plot

Figure 12: plot between the training and testing dataset

  1. Now let's developed the LSTM model. LSTM requires three dimensions of input. These include; the number of batch sizes, timestep and features. In this study, Im using TensorFlow's time-series generator module, where we need to specify the window length, sampling rate and batch size. You can read the documentation here

  2. Shown in Figure 13 is the model summary of the LSTM model. Im using two layer of LSTM with the neuron of 400 and 350, respectively. Other hyper-parameters include:

  • Optimizer = Adam
  • Learning rate = 0.001 (with callback of ReduceLROnPlateau)
  • Epoch = 1000 (with callback of Earlystopping)

Line plot

Figure 13: Model summary of the LSTM model

Now, observing the loss vs epoch, we can see that the validation loss was initially overfitting, but in the later epoch, it started to converge with the training loss at around 25th epoch.

Line plot

Figure 14: train loss vs Val loss

Let's observe the prediction's accuracy. Shown in Figure 15 is the line plot between the predicted and actual closing price. We can see that the predicted close price can follow closely with the actual Closing price. Looking at the accuracy scores, we get a decent score, where R^2 of 0.97, MAE of 52.8 and RMSE of 71.

true_close_vs_close_pred

Figure 15: Comparing the true close vs. close prediction at test dataset

So thats it!. I realy do hope that you enjoy and learnt something from this repo!. Im trying to make more this kin of repo and also youtube video. So stay tuned!### Result page

Refferences

https://www.analyticsvidhya.com/blog/2021/06/dimensionality-reduction-using-autoencoders-in-python/ https://www.sciencedirect.com/science/article/pii/S2666827022000378 https://medium.datadriveninvestor.com/a-high-level-introduction-to-lstms-34f81bfa262d https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359