sep-dataset

This repository releases the dataset for "Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models" [Paper].

Note: For text data, only the raw dataset is used in our work. The preprocessed dataset was used to conduct ablation studies with existing models.

Dataset Overview

Price and tweet data from 2020 to 2022 of 55 stocks, coming from the top 5 stocks in 11 industries.

The full list of stocks and their companies can be found in stocktable.pdf.

Data Components

This dataset comprises two main components,

./tweet: Tweet data from Twitter
./price: Price data from Yahoo Finance

Data Format

We collect data in the same format as the Stocknet Dataset.

As the number of tweets have increased exponentially, we also employed a clustering pipeline to obtain the most representative tweets for each day.

Raw Tweet Data

Format: JSON
Keys: see Introduction to Tweet JSON

Preprocessed Tweet Data

Format: JSON
Keys: 'text', 'created_at', 'user_id_str'

Raw Price Data

Format: CSV
Entries: date, open price, high price, low price, close price, adjusted close price, volume

Preprocessed Price Data

Format: TXT
Entries: date, close price, open price, high price, low price, close price change, volume
Note: open, high, low, close prices are normalized with the last close price, $p_t = {\tilde{p}_t / \tilde{p}^c_{t-1}}-1$.

Citation

If you use this dataset, please cite our paper.

@inproceedings{koa2024learning,
  title={Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models},
  author={Koa, Kelvin J.L. and Ma, Yunshan and Ng, Ritchie and Chua, Tat-Seng},
  booktitle={Proceedings of the ACM on Web Conference 2024},
  pages={4304–4315},
  year={2024}
}

eric-doug/sn2