/LSTM-and-ARIMA-Models-for-Stock-Forecasting

A hybrid forecasting model combining LSTM for sequence prediction and ARIMA for error correction. This repo demonstrates improved accuracy in financial trend prediction, showcasing training processes, error analysis, and performance metrics.

Primary LanguagePython

Using LSTM and ARIMA Models for Stock Forecasting

The LSTM model serves as the primary forecasting tool, leveraging its ability to capture long-term dependencies in sequential data. However, recognizing that even sophisticated models like LSTM can have prediction biases, an ARIMA model is employed to estimate and correct these errors. By doing so, the system harnesses the strengths of both models: LSTM's deep learning capabilities for handling complex patterns and ARIMA's effectiveness in modeling time series data.

The repository includes a detailed script that outlines the entire process, from data loading and preprocessing to model training and evaluation. The data_loader function sets the stage, preparing the dataset for analysis. It's followed by a series of plotting functions that visualize various aspects of the data, such as raw time series, training versus testing sets, and prediction errors.

The LSTM model's architecture is defined with several layers, including LSTM and Dense layers, and the model is trained using the historical closing prices of financial assets. After training, the model's predictions are plotted against the actual values to visualize the performance.

The ARIMA model then steps in to calculate the error of the LSTM's predictions. These error estimates are subsequently used to adjust the LSTM predictions, resulting in a final, corrected output. This final prediction is believed to be more accurate and is visualized alongside the actual data for evaluation.

Performance metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) are calculated to quantify the accuracy of the models. The repository captures these metrics in a structured format, allowing for clear interpretation of the model's effectiveness.

Data

Time-series data is a sequential collection of data points recorded at specific time intervals. In financial markets, time-series data primarily consists of stock prices, trading volumes, and various financial indicators gathered at regular time intervals. The significance of time-series data lies in its chronological order, a fundamental aspect that enables the identification of trends, cycles, and patterns critical for forecasting.

Long Short-Term Memory (LSTM):

Sequential Data Analysis a Leap Forward with LSTM Networks

Introduction :

The inception of Long Short-Term Memory (LSTM) networks marked a pivotal advancement in the field of sequential data analysis. These networks, a specialised evolution of Recurrent Neural Network (RNN) architectures, emerged to address the challenge of preserving information over extended sequences – a hurdle where traditional RNNs faltered due to the vanishing gradient dilemma. LSTMs were ingeniously crafted to retain critical data across long intervals, ensuring that pivotal past information influences future decisions.

Decoding LSTM Mechanisms :

In my program, I have utilised TensorFlow to construct and train an LSTM-based model for a specific task, likely related to time series forecasting. Let's break down how each of the LSTM components corresponds to my program:

Memory Cell (LSTM Cell) :

In my program, the memory cell is represented implicitly by the LSTM layer I've added using tf.keras.layers.LSTM(number_nodes, input_shape=(n, 1)). This LSTM layer acts as the memory cell of the network. The memory cell's purpose in my program is to capture and retain information over extended sequences. It is responsible for learning and remembering patterns and dependencies in the input time series data (middle_data) over time.

Input Gate:

The input gate is a crucial part of an LSTM unit that regulates what information should be added to the memory cell. It uses a sigmoid function to control the flow of input information and employs a hyperbolic tangent (tanh) function to create a vector of values ranging from -1 to +1.

In my program, the input gate is implicitly implemented by the LSTM layer (tf.keras.layers.LSTM) within TensorFlow. The LSTM layer manages the flow of input information, determines what information should be stored in its cell state, and applies appropriate weightings using sigmoid and tanh functions.

Forget Gate:

The forget gate is responsible for deciding which information in the memory cell should be discarded. It employs a sigmoid function to assess the importance of each piece of information in the current memory state. In my program, the forget gate's functionality is automatically handled by the LSTM layer. It learns to decide which information from the previous memory state should be forgotten or retained based on the patterns and dependencies it identifies in the input data.

Output Gate:

The output gate extracts valuable information from the memory cell to produce the final output. It combines the content of the memory cell with the input data, employing both tanh and sigmoid functions to regulate and filter the information before presenting it as the output. In my program, the output gate's operations are also encapsulated within the LSTM layer. It takes the current memory state and the input data to produce an output that is used for making predictions.

ARIMA:

The Linear Approach to Time-Series Forecasting

Introduction to the ARIMA Model:

The Autoregressive Integrated Moving Average (ARIMA) model stands as a fundamental pillar within the realm of statistical time-series analysis. Its inception by Box and Jenkins in the early 1970s brought forth a powerful framework that amalgamates autoregressive (AR) and moving average (MA) elements, all while incorporating differencing to stabilise the time-series (the "I" in ARIMA). ARIMA models are celebrated for their simplicity and efficacy in modelling an extensive array of time-series data, notably for their proficiency in capturing linear relationships.

  • Error Mining with ARIMA : After LSTM's predictions, the program calls on ARIMA to refine these forecasts. The Error_Evaluation function comes into play here, extracting the difference between the predicted and actual prices—essentially capturing the LSTM's predictive shortcomings.

  • ARIMA's Calibration : With the error data in hand, the ARIMA_Model function is invoked, wielding the ARIMA model as a fine brush to paint over the imperfections of the LSTM's initial output. The ARIMA model is trained on these residuals, learning to anticipate the LSTM's prediction patterns and, more importantly, its prediction errors.

  • Synthesis of Predictions : The Final_Predictions function represents the judgement of the program's operations. It does not merely output raw predictions but synthesises the LSTM's foresight with ARIMA's insights, producing a final prediction that encapsulates the strengths of both models.

Integrating LSTM and ARIMA

The integration of LSTM and ARIMA models presents a compelling hybrid approach to time-series forecasting. This methodology draws on the strengths of both models: LSTMs are capable of capturing complex non-linear patterns, while ARIMA excels at modelling the linear aspects of a time-series. By combining these two, one can potentially mitigate their individual weaknesses and enhance the overall predictive power.

Analysis of Combined Model Predictions :

Upon integrating LSTM and ARIMA, the model becomes robust against the volatility and unpredictability of financial time-series data. The predictions from the LSTM can be refined by the ARIMA model's error correction mechanism, which adds another layer of sophistication to the forecasts.

Comparative Analysis: LSTM vs. LSTM+ARIMA vs. Actual Values :

The predictions from LSTM, the hybrid LSTM+ARIMA model, and the actual values, several insights emerge. The LSTM model may capture the momentum and direction of stock prices effectively, but it might struggle with precision due to its sensitivity to recent data. The ARIMA model, conversely, may lag in capturing sudden market shifts but provides a smoothed forecast that averages out noise.

The hybrid model aims to balance these aspects. The LSTM component may anticipate a trend based on recent patterns, and the ARIMA part can adjust this forecast by considering the broader historical context. The final predictions, ideally, are more aligned with the actual values than either model could achieve on its own.

Implementation of the Program :

Function Definition and Working

data_loader()

Purpose: The data_loader function is designed to load financial time-series data from a CSV file and prepare it as a DataFrame formatted for time series analysis.

Input: The function takes no parameters but relies on a globally defined Filename_address variable that contains the path to the CSV file.

Processing Elements:

  1. Pandas Library: Utilized for its powerful data manipulation capabilities, particularly for reading CSV files and handling time series data.
  2. Global Variables: It uses the Filename_address to locate the CSV file.
  3. DataFrame Operations:
    • pd.read_csv: Reads the CSV file into a DataFrame, with the 'Date' column set as the index and parsed as datetime objects for time series analysis.
    • dropna: Removes any rows with missing values to ensure the integrity of the time series data.

Output: The function returns a DataFrame object containing the clean, time-indexed financial data.

Pseudo Code Algorithm

Function data_loader
    Define column names as ["Open", "High", "Low", "Close", "Adj_Close", "Volume"]
    Load CSV file from 'Filename_address' into a DataFrame with 'Date' as index
    Set DataFrame columns to the defined column names
    Drop any rows with missing values
    Print the shape of the DataFrame
    Print the first few rows of the DataFrame
    Return the cleaned DataFrame
EndFunction

Flow of the Program for data_loader()

  1. Initialize the column names for the financial data.
  2. Use the Pandas function read_csv to read the data from the CSV file specified by the Filename_address.
  3. Set the index of the DataFrame to the 'Date' column, which is parsed as datetime.
  4. Assign the predefined column names to the DataFrame to maintain consistency.
  5. Remove any rows with missing data to ensure the data quality for subsequent analysis.

Function Definition and Working

plot_predictions(train, predictions, title)

Purpose: The plot_predictions function is designed to visualize the actual vs. predicted financial time-series data. It generates a plot that overlays the predicted values over the actual values, allowing for a visual comparison.

Input:

  • train: A pandas Series or DataFrame containing the actual values indexed by date.
  • predictions: A pandas Series or DataFrame containing the predicted values, expected to be of the same length and with the same index as train.
  • title: A string representing the title of the plot, which will also be used in naming the saved plot file.

Processing Elements:

  1. Matplotlib Library: Used for creating visualizations.
  2. Global Variables: Utilizes Output_address to determine the save path for the plot image.

Output:

  • The function saves a .jpg image file of the plot to the location specified by Output_address with the given title as its name.
  • No value is returned by the function.

Pseudo Code Algorithm

Function plot_predictions with parameters: train, predictions, title
    Initialize a new figure with specified dimensions (10x5 inches)
    Plot the 'train' data with the index on the x-axis and values on the y-axis, labeled as 'Actual'
    Plot the 'predictions' data on the same axes, labeled as 'Predicted' in red color
    Set the title of the plot
    Set the x-axis label as 'Date'
    Set the y-axis label as 'Close-Price'
    Concatenate the `Output_address` with the `title` and ".jpg" to form the file path
    Save the figure to the file path
EndFunction

Flow of the Program for plot_predictions()

  1. Start by creating a new figure with the defined size.
  2. Plot the actual values (train) against their date index, labeling this line as 'Actual'.
  3. Plot the predicted values (predictions) on the same plot, using a different color and labeling it 'Predicted'.
  4. Assign the provided title to the plot.
  5. Label the x-axis as 'Date' and the y-axis as 'Close-Price' to indicate what the axes represent.
  6. Combine the Output_address directory path with the title of the plot to create the full file path for saving.
  7. Save the figure as a .jpg file at the determined file path.
  8. The plot is now saved to the local file system, and the function terminates without returning any value.

Function Definition and Working

plot_train_test(train, test)

Purpose:
The plot_train_test function generates a plot to visualize the partition of financial time-series data into training and testing sets. This visual aid is important to verify the partitioning and observe the continuity and potential discrepancies between the train and test sets.

Input:

  • train: A pandas Series or DataFrame containing the training set data, indexed by date.
  • test: A pandas Series or DataFrame containing the testing set data, indexed by date.

Processing Elements:

  1. Matplotlib Library: Used for creating and saving the plot.
  2. Global Variables: The function uses Output_address for determining where to save the output image.

Output:

  • The function outputs a plot saved as a .jpg file to the location specified by Output_address. The plot displays the training and testing data series.

Pseudo Code Algorithm

Function plot_train_test with parameters: train, test
    Initialize a new figure with a size of 10x5 inches
    Plot the 'train' series against its index with a label 'Train Set'
    Plot the 'test' series against its index with a label 'Test Set' and set the color to orange
    Set the title of the plot to 'Train and Test Data'
    Set the x-axis label to 'Date'
    Set the y-axis label to 'Close Price'
    Concatenate `Output_address` with the filename ' Train and Test Data .jpg'
    Save the figure to the specified address
EndFunction

Flow of the Program for plot_train_test()

  1. Begin by initiating a new figure for plotting with specified dimensions (10x5 inches).
  2. Plot the training dataset (train) on the figure, with dates on the x-axis and training data values on the y-axis, labeling it as 'Train Set'.
  3. Plot the testing dataset (test) on the same figure, with dates on the x-axis and testing data values on the y-axis, labeling it as 'Test Set' and using a distinct orange color for differentiation.
  4. Title the plot 'Train and Test Data' to describe the plotted data.
  5. Label the x-axis as 'Date' to indicate the time component and the y-axis as 'Close Price' to denote the financial metric plotted.
  6. Construct the file path for saving the plot by combining Output_address with the designated file name ' Train and Test Data .jpg'.
  7. Save the plot to the constructed file path.
  8. The function concludes after saving the plot, and it does not return any values.

Function Definition and Working

plot_prediction_errors(errors)

Purpose:
The plot_prediction_errors function is used to visualize the errors over time between actual and predicted values in a time series forecasting model. This can help in identifying patterns or biases in the prediction errors.

Input:

  • errors: A list or pandas Series containing the prediction errors, typically calculated as the difference between actual and predicted values.

Processing Elements:

  1. Matplotlib Library: This function utilizes Matplotlib to create and save a visualization plot of the prediction errors.
  2. Global Variables: Output_address is used to determine where the plot image will be saved.

Output:

  • The function saves a .jpg file of the error plot to the directory specified by Output_address.

Pseudo Code Algorithm

Function plot_prediction_errors with parameter: errors
    Initialize a new figure with a size of 10x5 inches
    Plot 'errors' with labeling as 'Prediction Errors'
    Set the title of the plot to 'Prediction Errors over Time'
    Set the x-axis label to 'Time Step'
    Set the y-axis label to 'Error'
    Create a legend for the plot
    Form the save address by concatenating `Output_address` with ' Prediction Errors over Time .jpg'
    Save the figure to the address
EndFunction

Flow of the Program for plot_prediction_errors()

  1. Initiate a new figure with the specified dimensions for the plot.
  2. Plot the errors provided by the errors parameter against their corresponding time step.
  3. Title the plot 'Prediction Errors over Time' to accurately reflect the data being visualized.
  4. Label the x-axis as 'Time Step' to represent the sequential nature of the data points.
  5. Label the y-axis as 'Error' to represent the magnitude of the prediction errors.
  6. Add a legend to the plot for clarity, which describes the data series plotted.
  7. Construct the full file path where the plot will be saved by appending ' Prediction Errors over Time .jpg' to the Output_address.
  8. Save the plot to the specified file path.
  9. The function completes its execution after the plot is saved, without returning any values.

Function Definition and Working

plot_final_predictions(test, final_predictions)

Purpose:
plot_final_predictions is designed to create a visualization comparing the actual values from the test dataset with the final corrected predictions. This helps to assess the accuracy and effectiveness of the error correction applied to the predictive model.

Input:

  • test: A pandas Series or DataFrame containing the test set data, indexed by date.
  • final_predictions: A pandas Series or DataFrame of the same length and with the same index as test containing the final predictions after error correction.

Processing Elements:

  1. Matplotlib Library: It is utilized for plotting and saving the comparison plot.
  2. Global Variables: The function requires Output_address to define the path where the plot image will be saved.

Output:

  • The function outputs a plot saved as a .jpg file to the location determined by Output_address. The plot displays the actual values and the corrected predictions.

Pseudo Code Algorithm

Function plot_final_predictions with parameters: test, final_predictions
    Initialize a new figure with a size of 10x5 inches
    Plot the 'test' series against its index with a label 'Actual'
    Plot the 'final_predictions' series against the same index with a label 'Corrected Prediction' in green color
    Set the title of the plot to 'Final Predictions with Error Correction'
    Set the x-axis label to 'Date'
    Set the y-axis label to 'Close Price'
    Create a legend for the plot
    Form the save address by concatenating `Output_address` with the file name ' Final Predictions with Error Correction .jpg'
    Save the figure to the constructed address
EndFunction

Flow of the Program for plot_final_predictions()

  1. Begin by initiating a new plotting figure with the given dimensions.
  2. Plot the actual test data (test) with the date index on the x-axis and close prices on the y-axis, labeled as 'Actual'.
  3. Plot the final corrected predictions (final_predictions) on the same axes, labeling it as 'Corrected Prediction' and using green color for distinction.
  4. Title the plot 'Final Predictions with Error Correction' to describe its purpose.
  5. Label the x-axis 'Date' and the y-axis 'Close Price' to indicate what the plot represents.
  6. Add a legend to the plot to identify the data series.
  7. Construct the file path for saving the plot by combining Output_address with the file name ' Final Predictions with Error Correction .jpg'.
  8. Save the plot to the determined file path.
  9. The function concludes after saving the plot, and it does not return any value.

Function Definition and Working

plot_accuracy(mse, rmse, mae)

Purpose:
The plot_accuracy function generates a bar chart to visually represent the accuracy metrics of a predictive model. These metrics typically include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

Input:

  • mse: A numerical value representing the Mean Squared Error.
  • rmse: A numerical value representing the Root Mean Squared Error.
  • mae: A numerical value representing the Mean Absolute Error.

Processing Elements:

  1. Matplotlib Library: Used for plotting and saving the accuracy metrics as a bar chart.
  2. Global Variables: The function uses Output_address to determine the directory path where the plot image will be saved.

Output:

  • The function outputs a bar chart saved as a .jpg file to the directory specified by Output_address.

Pseudo Code Algorithm

Function plot_accuracy with parameters: mse, rmse, mae
    Define a list 'metrics' with the values 'MSE', 'RMSE', 'MAE'
    Define a list 'values' with the input parameters mse, rmse, mae
    Initialize a new figure with a size of 10x5 inches
    Plot a bar chart with 'metrics' as the x-axis and 'values' as the heights of the bars
    Assign different colors to each bar for distinction
    Set the title of the plot to 'Model Accuracy Metrics'
    Form the save address by concatenating `Output_address` with the file name ' Model Accuracy Metrics .jpg'
    Save the figure to the specified address
EndFunction

Flow of the Program for plot_accuracy()

  1. Define the names of the metrics to be plotted (MSE, RMSE, MAE) in a list.
  2. Gather the provided accuracy metric values into a list corresponding to the metric names.
  3. Initialize a new plotting figure with predetermined dimensions (10x5 inches).
  4. Create a bar chart with the metric names on the x-axis and their corresponding values as the heights of the bars, with each bar colored differently for easy distinction.
  5. Title the plot 'Model Accuracy Metrics' to clearly indicate what the chart represents.
  6. Determine the file path for saving the plot by appending ' Model Accuracy Metrics .jpg' to the Output_address.
  7. Save the bar chart to the constructed file path.
  8. The function ends after the bar chart is saved and does not return any values.

Function Definition and Working

plot_arima_accuracy(mse, rmse, mae)

Purpose:
The plot_arima_accuracy function visualizes the accuracy metrics specific to an ARIMA model using a bar chart. This visualization assists in the evaluation of the model's performance by representing Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) as bar heights.

Input:

  • mse: A numeric value indicating the Mean Squared Error.
  • rmse: A numeric value indicating the Root Mean Squared Error.
  • mae: A numeric value indicating the Mean Absolute Error.

Processing Elements:

  1. Matplotlib Library: Employs matplotlib to create and save a bar chart.
  2. Global Variables: The function utilizes Output_address for the path where the bar chart will be saved.

Output:

  • This function outputs a bar chart saved as a .jpg file in the directory specified by Output_address.

Pseudo Code Algorithm

Function plot_arima_accuracy with parameters: mse, rmse, mae
    Define a list 'metrics' with elements 'MSE', 'RMSE', 'MAE'
    Define a list 'values' with the input parameters mse, rmse, mae
    Initialize a new figure with dimensions of 10 by 5 inches
    Create a bar chart with 'metrics' on the x-axis and 'values' as the bar heights
    Assign specific colors to each bar (blue for MSE, orange for RMSE, green for MAE)
    Set the chart title to 'ARIMA Model Accuracy Metrics'
    Determine the save address by concatenating `Output_address` with ' Model Accuracy Metrics .jpg'
    Save the figure to the defined address
EndFunction

Flow of the Program for plot_arima_accuracy()

  1. Initialize a list called metrics with the names of the accuracy metrics to be displayed.
  2. Create a list called values containing the values of MSE, RMSE, and MAE passed to the function.
  3. Begin a new plot with a figure size set to 10x5 inches.
  4. Plot a bar chart where the x-axis contains the metric names from metrics and the y-axis corresponds to their respective values from values.
  5. Assign a distinct color to each bar to visually differentiate between the metrics.
  6. Title the plot 'ARIMA Model Accuracy Metrics' to clearly convey the plot's focus.
  7. Formulate the full file path for saving the chart by appending ' Model Accuracy Metrics .jpg' to the Output_address.
  8. Save the bar chart to the file path that was created.
  9. The function terminates after the plot is saved, without returning any value.

Function Definition and Working

data_allocation(data)

Purpose:
The data_allocation function is tasked with partitioning a given dataset into training and testing sets for model development and evaluation. This split is essential for assessing the model's performance on unseen data.

Input:

  • data: A pandas DataFrame that contains the time series data with one of the columns being close, representing the closing price which is typically used in financial time series forecasting.

Processing Elements:

  1. Global Variables:
    • days: The number of entries from the end of the dataset to be allocated to the test set.
    • close: A string that denotes the column name for the closing prices in the data DataFrame.

Output:

  • train: A pandas Series or DataFrame containing the training set data.
  • test: A pandas Series or DataFrame containing the testing set data.

Pseudo Code Algorithm

Function data_allocation with parameter: data
    Calculate train_len_val by subtracting the number of days (global variable) from the length of the data
    Split the 'data' into 'train' and 'test' sets by slicing:
        'train' contains all entries from start up to train_len_val
        'test' contains all entries from train_len_val to the end
    Print the training set and its size
    Print the testing set and its size
    Return the 'train' and 'test' sets
EndFunction

Flow of the Program for data_allocation()

  1. Determine the length of the training set by subtracting the global variable days from the total length of the dataset.
  2. Allocate the first segment of the dataset up to the determined length to the training set.
  3. Allocate the remaining segment from the determined length to the end of the dataset to the testing set.
  4. Print a descriptive message followed by the training set and its size to provide an immediate visual confirmation of the data partitioning.
  5. Print a descriptive message followed by the testing set and its size for the same reasons as above.
  6. Return both the training set and the testing set to be used in subsequent stages of the model development and evaluation process.

Function Definition and Working

apply_transform(data, n)

Purpose:
The apply_transform function is designed to transform time series data into a format suitable for training LSTM (Long Short-Term Memory) networks. The transformation involves creating sequences of n previous data points (lags) to predict the next value.

Input:

  • data: A pandas Series or numpy array containing the time series data.
  • n: An integer that defines the number of lags, i.e., the size of the input sequence for the LSTM model.

Processing Elements:

  1. NumPy Library: Used for numerical operations and to transform the list of sequences into a numpy array suitable for the LSTM input.
  2. List Comprehension: Constructs the sequences of lags (input data) and the target values (what the model will learn to predict).

Output:

  • middle_data: A numpy array of shape (number of sequences, n, 1), where each sequence is a sliding window of n lagged values from the data.
  • target_data: A numpy array containing the target values corresponding to each sequence in middle_data.

Pseudo Code Algorithm

Function apply_transform with parameters: data, n
    Initialize an empty list called 'middle_data'
    Initialize an empty list called 'target_data'
    Loop over the data starting from index n to the end of the data:
        Extract a sequence of 'n' values from 'data' ending at the current index
        Append the sequence to 'middle_data'
        Append the current value of 'data' to 'target_data'
    Convert 'middle_data' into a numpy array and reshape it to (len(middle_data), n, 1)
    Convert 'target_data' into a numpy array
    Return 'middle_data' and 'target_data'
EndFunction

Flow of the Program for apply_transform()

  1. Initialize two empty lists: middle_data for storing the input sequences and target_data for the corresponding target values.
  2. Iterate over the data series starting from the nth element to the end.
  3. For each iteration, extract a sequence of n values from the data series leading up to the current index and append this sequence to middle_data.
  4. Append the value at the current index of the data series to target_data as the target value for the previously extracted sequence.
  5. After the loop, convert middle_data into a numpy array and reshape it to have the dimensions suitable for LSTM input, which is (number of sequences, n, 1).
  6. Convert target_data into a numpy array without reshaping since it represents the target values.
  7. Return the middle_data and target_data arrays for use in training the LSTM model.

Function Definition and Working

LSTM(train, n, number_nodes, learning_rate, epochs, batch_size)

Purpose:
The LSTM function builds, compiles, and trains a Long Short-Term Memory (LSTM) neural network model using the provided time series training data. The model aims to predict future values in the series based on the input sequences of historical data.

Input:

  • train: A pandas Series or numpy array containing the time series training data.
  • n: An integer defining the number of lagged data points to use as input for the LSTM model.
  • number_nodes: The number of neurons in each LSTM and Dense layer of the neural network.
  • learning_rate: The learning rate for the optimizer during training.
  • epochs: The number of epochs to train the model.
  • batch_size: The number of samples per gradient update during training.

Processing Elements:

  1. TensorFlow and Keras: Utilized for creating the LSTM model, compiling it, and fitting it to the training data.
  2. apply_transform Function: Called to transform the training data into sequences suitable for LSTM input.
  3. Sequential Model API: Used for stacking layers to build the LSTM model.
  4. Adam Optimizer: An algorithm for first-order gradient-based optimization of stochastic objective functions.

Output:

  • model: The trained Keras Sequential LSTM model.
  • history: A record of training loss and accuracy values at successive epochs.
  • full_predictions: The model's predictions for the input data used during training.

Pseudo Code Algorithm

Function LSTM with parameters: train, n, number_nodes, learning_rate, epochs, batch_size
    Transform 'train' data into sequences and targets using apply_transform function
    Initialize a Sequential LSTM model
        Add Input layer with shape (n,1)
        Add LSTM layer with 'number_nodes' neurons
        Add two Dense layers each with 'number_nodes' neurons and 'relu' activation
        Add a Dense output layer with a single neuron
    Compile the model with 'mse' loss function, Adam optimizer with 'learning_rate', and 'mean_absolute_error' metric
    Fit the model to 'middle_data' and 'target_data' for 'epochs' with 'batch_size', without verbosity
    Predict on 'middle_data' to obtain full predictions
    Return the model, training history, and full predictions
EndFunction

Flow of the Program for LSTM()

  1. Call apply_transform with the training data train and the lag value n to prepare the input and target data for the LSTM.
  2. Define the LSTM model architecture using the Sequential API from Keras with an input layer, LSTM layer, two dense layers, and an output layer.
  3. Compile the LSTM model with the mean squared error loss function, Adam optimizer with the specified learning rate, and mean absolute error as a performance metric.
  4. Train the model on the transformed data for the given number of epochs and batch size.
  5. After training, use the model to predict on the input data to get the full set of predictions.
  6. Output the trained model, the history of its performance over the epochs, and the full predictions array.

Function Definition and Working

calculate_accuracy(true_values, predictions)

Purpose:
The function calculate_accuracy computes common statistical accuracy metrics to evaluate the performance of regression models, specifically Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

Input:

  • true_values: An array-like structure, typically a numpy array or pandas Series, that contains the actual observed values.
  • predictions: An array-like structure with the predicted values, expected to be of the same length as true_values.

Processing Elements:

  1. Mean Squared Error (MSE): This metric measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value.
  2. Root Mean Squared Error (RMSE): It is the square root of the MSE and measures the standard deviation of the residuals.
  3. Mean Absolute Error (MAE): This metric measures the average magnitude of the errors in a set of predictions, without considering their direction.

Output:

  • mse: A float representing the Mean Squared Error.
  • rmse: A float representing the Root Mean Squared Error.
  • mae: A float representing the Mean Absolute Error.

Pseudo Code Algorithm

Function calculate_accuracy with parameters: true_values, predictions
    Calculate MSE by taking the mean of the squared differences between true_values and predictions
    Calculate RMSE by taking the square root of MSE
    Calculate MAE by taking the mean of the absolute differences between true_values and predictions
    Return mse, rmse, mae
EndFunction

Flow of the Program for calculate_accuracy()

  1. Utilize the mean_squared_error function from sklearn.metrics to calculate the MSE between the true_values and predictions.
  2. Compute the RMSE by taking the square root of the MSE using numpy's sqrt function.
  3. Calculate the MAE using the mean_absolute_error function from sklearn.metrics.
  4. Return the computed values of MSE, RMSE, and MAE to be used as accuracy metrics for the model evaluation.

Function Definition and Working

Error_Evaluation(train_data, predict_train_data, n)

Purpose:
The Error_Evaluation function is designed to calculate the errors between the actual training data and the predictions made by the LSTM model. This can be used for further analysis of the model's performance and error correction.

Input:

  • train_data: A pandas Series or numpy array containing the actual observed training values.
  • predict_train_data: A pandas Series or numpy array containing the predicted values obtained from the LSTM model, expected to be of the same length as train_data after accounting for the lag n.
  • n: An integer representing the number of lagged observations used in the LSTM model (the size of the input sequence).

Processing Elements:

  1. List Comprehension: Iterates through the predicted data to compute the difference with the actual data, point by point.

Output:

  • errors: A list of error values representing the difference between the actual and predicted values.

Pseudo Code Algorithm

Function Error_Evaluation with parameters: train_data, predict_train_data, n
    Initialize an empty list called 'errors'
    Loop through the indices of predict_train_data:
        Calculate the error at each point as the difference between the actual value (train_data at index n+i) and the predicted value (predict_train_data at index i)
        Append the error to the 'errors' list
    Return the 'errors' list
EndFunction

Flow of the Program for Error_Evaluation()

  1. Initialize an empty list to store the error values.
  2. Iterate over the predicted training data.
  3. For each predicted value, calculate the error by subtracting the predicted value from the actual value (considering the lag n).
  4. Store each error value in the list.
  5. Return the complete list of errors after the iteration is finished. This list can be used to analyze the distribution and pattern of errors made by the model during training.

Function Definition and Working

Parameter_calculation(data)

Purpose:
The Parameter_calculation function aims to determine the optimal parameters for an ARIMA (Autoregressive Integrated Moving Average) model using the given time series data. It also generates plots for the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF), which are helpful for identifying the ARIMA model's parameters.

Input:

  • data: A pandas Series or numpy array containing the time series data.

Processing Elements:

  1. auto_arima from pmdarima: This is a function that automates the process of ARIMA modeling, including the selection of optimal parameters.
  2. plot_acf from statsmodels: Generates an ACF plot, which is used to identify the number of MA (Moving Average) terms.
  3. plot_pacf from statsmodels: Generates a PACF plot, which is used to identify the number of AR (Autoregressive) terms.
  4. Global Variables:
    • lag: Used to set the number of lags in the ACF and PACF plots.
    • Output_address: Used to specify the directory path where the ACF and PACF plot images will be saved.

Output:

  • ord: A tuple representing the order of the ARIMA model, which consists of (p, d, q) parameters where 'p' is the number of AR terms, 'd' is the degree of differencing, and 'q' is the number of MA terms.

Pseudo Code Algorithm

Function Parameter_calculation with parameter: data
    Run auto_arima on 'data' with tracing enabled to find optimal parameters
    Plot the ACF of 'data' using the global 'lag' variable
    Save the ACF plot to the 'Output_address' directory with the filename "ACF.jpg"
    Plot the PACF of 'data' using the global 'lag' variable
    Save the PACF plot to the 'Output_address' directory with the filename "PACF.jpg"
    Extract the order (p, d, q) of the ARIMA model from the findings of auto_arima
    Return the order of the ARIMA model
EndFunction

Flow of the Program for Parameter_calculation()

  1. Execute the auto_arima function on the input data to automatically determine the best-fitting ARIMA model parameters while printing the trace of the fitting process.
  2. Plot the ACF for the given data up to the number of lags specified by lag.
  3. Save the ACF plot to the specified Output_address directory with the appropriate filename.
  4. Plot the PACF for the given data up to the number of lags specified by lag.
  5. Save the PACF plot to the specified Output_address directory with the appropriate filename.
  6. Retrieve the order of the ARIMA model (p, d, q) from the results of the auto_arima function.
  7. Return the ARIMA model order for use in subsequent model fitting.

Function Definition and Working

ARIMA_Model(train, len_test, ord)

Purpose:
The ARIMA_Model function fits an ARIMA model to the training data and uses it to make predictions. The primary use in this context is to forecast the potential errors from an LSTM model, which can then be used for error correction in the LSTM's predictions.

Input:

  • train: A pandas Series or numpy array containing the training set data used to fit the ARIMA model.
  • len_test: An integer representing the length of the test dataset, which dictates how many future steps to predict.
  • ord: A tuple indicating the order of the ARIMA model, typically obtained from the Parameter_calculation function, which consists of (p, d, q) parameters.

Processing Elements:

  1. ARIMA from statsmodels: A class that represents an ARIMA model, used here for time series forecasting.
  2. Fitting the Model: The ARIMA model is fitted to the training data using the provided order parameters.
  3. Predictions: The model is used to make predictions for the specified future time steps.

Output:

  • model: The fitted ARIMA model object.
  • predictions: The forecasts from the model starting from the end of the training set to the length of the test set.
  • full_predictions: The full set of in-sample predictions for the training data.

Pseudo Code Algorithm

Function ARIMA_Model with parameters: train, len_test, ord
    Initialize an ARIMA model with 'train' data and 'ord' order
    Fit the ARIMA model to the 'train' data
    Make predictions from the end of 'train' data up to the length of the test set plus one
    Make full in-sample predictions for the 'train' data
    Return the fitted model, out-of-sample predictions, and in-sample predictions
EndFunction

Flow of the Program for ARIMA_Model()

  1. Instantiate an ARIMA model with the training data train and the order parameters ord.
  2. Fit the model to the training data using the fit() method.
  3. Use the predict method of the fitted model to forecast future values for a range starting at the end of the training set and extending len_test steps into the future.
  4. Also, generate a full set of in-sample predictions for the training data, which covers the entire range of the training set.
  5. Return the fitted ARIMA model, the out-of-sample predictions for error correction, and the in-sample predictions for evaluation purposes.

Function Definition and Working

Final_Predictions(predictions_errors, predictions)

Purpose:
The Final_Predictions function calculates the final forecasted values by adjusting the LSTM model predictions with the ARIMA model-predicted errors. This technique is often used in hybrid models to correct predictions from one model using insights from another.

Input:

  • predictions_errors: A list or pandas Series containing the errors between the actual values and the LSTM model's predictions, as forecasted by the ARIMA model.
  • predictions: A list or pandas Series containing the LSTM model's predictions.

Processing Elements:

  1. List Iteration: A loop that runs through the number of days (a globally set variable), combining the predictions from the LSTM model and the errors predicted by the ARIMA model.

Output:

  • final_values: A list of the corrected predictions after accounting for the ARIMA-predicted errors.

Pseudo Code Algorithm

Function Final_Predictions with parameters: predictions_errors, predictions
    Initialize an empty list 'final_values'
    Loop over the range of 'days' (global variable):
        Calculate the final value by adding the prediction error to the LSTM prediction at each index
        Append the final value to 'final_values'
    Return 'final_values'
EndFunction

Flow of the Program for Final_Predictions()

  1. Start by creating an empty list final_values to store the adjusted predictions.
  2. Loop through a range of indices defined by the global variable days, which determines how many final predictions to calculate.
  3. At each iteration, add the corresponding prediction error from predictions_errors to the LSTM prediction from predictions and append the result to final_values.
  4. After the loop completes, return final_values, which contains the final adjusted predictions.

Function Definition and Working

main()

Purpose:
The main function orchestrates the entire process of loading data, preparing it, training the LSTM model, making predictions, evaluating errors, and generating various plots and outputs. It serves as the entry point for running the time series forecasting program.

Input:
There are no direct inputs to the main function as it stands alone. It relies on global variables and the functions it calls to operate on the data.

Processing Elements:

  1. Data loading and plotting functions: data_loader, plot_raw_data
  2. Data partitioning function: data_allocation
  3. Model training and prediction functions: LSTM, Error_Evaluation, Parameter_calculation, ARIMA_Model, Final_Predictions
  4. Accuracy calculation functions: calculate_accuracy
  5. Plotting accuracy and errors: plot_train_test, plot_predictions, plot_prediction_errors, plot_final_predictions, plot_accuracy, plot_arima_accuracy
  6. File writing: Outputs model summaries and predictions to a text file.

Output:
The main function does not return any value. Its outputs are:

  • Plots saved as images in the specified output directory.
  • Console prints of model summaries and accuracy metrics.
  • A text file saved with detailed model information and predictions.

Pseudo Code Algorithm

Function main
    Load data using data_loader function
    Plot raw data using plot_raw_data function
    Partition data into training and testing sets using data_allocation function
    Plot training and testing data using plot_train_test function
    
    Start timing the LSTM model process
    Train LSTM model using LSTM function
    Plot LSTM predictions using plot_predictions function
    Make new predictions using the trained LSTM model
    
    Evaluate errors in LSTM predictions using Error_Evaluation function
    Plot prediction errors using plot_prediction_errors function
    Calculate accuracy of LSTM predictions using calculate_accuracy function
    Plot LSTM accuracy using plot_accuracy function
    
    Determine ARIMA model parameters using Parameter_calculation function
    Fit ARIMA model and make predictions on errors using ARIMA_Model function
    Calculate ARIMA model accuracy and plot it using plot_arima_accuracy function
    
    Calculate final predictions by combining LSTM predictions and ARIMA predicted errors using Final_Predictions function
    Plot final predictions using plot_final_predictions function
    
    Write LSTM and ARIMA model details, predictions, and accuracies to an output text file
    Print the time taken for the entire process
EndFunction

Call main function if the script is the main program

Flow of the Program for main()

  1. Call data_loader to load the dataset.
  2. Call plot_raw_data to visualize the raw dataset.
  3. Call data_allocation to split the data into training and testing sets.
  4. Call plot_train_test to visualize the training and testing datasets.
  5. Train the LSTM model by calling LSTM and plot its predictions.
  6. Generate predictions for the test set using the trained LSTM model.
  7. Evaluate prediction errors with Error_Evaluation and visualize them.
  8. Calculate and print the LSTM model's accuracy, plotting the results.
  9. Determine the best ARIMA model parameters and fit the ARIMA model to predict errors.
  10. Plot ARIMA model accuracy.
  11. Combine LSTM predictions with ARIMA-predicted errors using Final_Predictions and visualize the final predictions.
  12. Write all relevant outputs, including model summaries, accuracies, and predictions, to a text file.
  13. Print the total time taken for the process.
  14. Execute the main function if the script is run as the main program.