/CGAN

Primary LanguageJupyter Notebook

Synthetic Stock Data Generation using Conditional GANs

Project Overview

This project utilizes a Conditional Generative Adversarial Network (CGAN) to generate synthetic stock data, specifically the closing prices. The choice of a CGAN allows the model to generate stock data conditioned on specific attributes, in this case, the volatility of the stock data.

Data Collection and Preprocessing

  1. Fetching Stock Data: Data for a given ticker (e.g., AAPL) is fetched for a period of two years using the yfinance library.

  2. Preprocessing:

    • The raw stock closing prices are scaled using MinMax scaling.
    • The volatility of the stock data, computed as the rolling standard deviation, is also scaled using MinMax scaling.

This preprocessing step ensures the data is in a suitable range for the neural network training and also prepares volatility as a condition for the CGAN.

Conditional GAN Architecture

Generator

The generator is designed to generate synthetic stock data. It takes in random noise along with the scaled volatility as a condition to produce the synthetic stock closing prices.

  1. Input: Concatenation of noise and the condition (volatility).
  2. Architecture:
    • Fully connected layers with Batch Normalization and Dropout.
    • LeakyReLU activation functions are used, with a linear activation at the output layer.

Discriminator

The discriminator tries to distinguish between real stock data and the synthetic data generated by the generator.

  1. Input: Real or synthetic stock data concatenated with the condition (volatility).
  2. Architecture:
    • Fully connected layers with Dropout.
    • LeakyReLU activation functions are used.

Loss Functions and Optimizers

  • The discriminator uses binary cross-entropy loss, comparing its predictions on real data to an array of ones and its predictions on fake (synthetic) data to an array of zeros.

  • The generator’s loss function is a composite of multiple components:

    1. Main Loss: Binary cross-entropy loss on discriminator’s predictions of the generated data.
    2. Temporal Difference: Measures the difference between consecutive points in the generated and real stock data.
    3. Mean Squared Error (MSE): Measures the mean squared differences between consecutive points in the generated stock data.
    4. Variance Loss: Ensures the variance of the generated data matches the variance of the real data.

Training

The CGAN is trained iteratively with the generator and discriminator competing against each other. The generator aims to produce stock data so realistic that the discriminator can't tell it's fake, while the discriminator tries to get better at distinguishing real data from fake.

During training:

  • Generator creates synthetic stock data.
  • Discriminator tries to distinguish this synthetic data from the real stock data.
  • Both models are updated based on their performance.

Post-training, the generator and discriminator losses over the epochs are visualized to assess the training process.

Post-processing and Synthetic Data Generation

  1. Synthetic Data Generation: After training, the generator is utilized to generate synthetic stock data conditioned on the scaled volatilities of the original stock data.

  2. Post-processing: The synthetic stock data generated by the model is then rescaled back to the original data's scale using the inverse of the MinMax scaling applied during preprocessing.

  3. Smoothing Series: The synthetic dataset is smoothened using a rolling average with a window of 5.

  4. Rolling Volatility: The rolling volatility of the original dataset with a window of 5 is computed.

  5. Residual Trends: The trends in the original dataset, that deviate from the smoothed version of it, are captured.

  6. Dynamic Volatility Scaling: The synthetic data is scaled dynamically based on its correlation with the rolling volatility of the original data.

  7. Top Volatile Days: Days in the original dataset that experience the highest 15% of volatilities are flagged. If such volatile days occur consecutively, only the first day is considered for the analysis.

  8. Data Merging: The synthetic data, scaled volatility, and the residual trends are combined to get a volatile version of the synthetic dataset.

Analysis and Visualization

  1. Visualization: The original and synthetic datasets are visualized over time to provide a visual comparison between the two.

  2. Statistical Properties: The mean, standard deviation, and quartiles (25th, 50th, and 75th) of the two datasets are computed and displayed.

  3. Frequency Distribution: The frequency distribution of both datasets is depicted using a Kernel Density Estimate plot.

  4. Volatility Analysis: Using a 20-day rolling window, the volatilities of both datasets are visualized and their descriptive statistics are provided.

  5. Risk Metric Analysis:

    • Value at Risk (VaR) and Conditional Value at Risk (CVaR) are computed for both datasets and visualized using bar plots.
    • The exact VaR and CVaR values are provided for clarity.
  6. Distributional Analysis:

    • A Kolmogorov-Smirnov test is performed to compare the distribution of returns from both datasets.
    • Skewness and kurtosis of both datasets are computed and visualized using bar plots.
  7. Spectral Analysis:

    • The spectral density of the returns from both datasets is visualized.
    • Statistics like maximum, minimum, mean, and standard deviation of the spectral density for the original returns are provided.

Conclusion

  • Visualization shows how the synthetic data, though following a similar pattern, has its differences when juxtaposed with the original data.
  • Statistical properties indicate that the synthetic data has a slightly lower mean and standard deviation than the original data.
  • The frequency distribution plots hint at how the two datasets might differ in terms of distribution.
  • Volatility analysis uncovers that the synthetic data is slightly less volatile on average but has had instances of higher volatility.
  • Risk metric analysis reveals that the synthetic data has a higher potential risk associated with it as showcased by its higher VaR and CVaR values.
  • Distributional Analysis suggests significant differences in the return distributions of the two datasets. The synthetic data has higher skewness and kurtosis, meaning it has heavier tails and a sharper peak.
  • Spectral analysis offers insights into the frequency domain characteristics of the returns. The spectral density statistics highlight the oscillatory characteristics present in the original returns.

By inspecting the various analyses, it's evident that while the synthetic data does mimic some characteristics of the original, there are clear distinctions in their statistical properties, risk profiles, and distributional attributes. This information is crucial for any stakeholder intending to use the synthetic dataset as a replacement or supplement for the original stock dataset.