This repository releases the dataset for "Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models" [Paper].
Note: For text data, only the raw dataset is used in our work. The preprocessed dataset was used to conduct ablation studies with existing models.
Price and tweet data from 2020 to 2022 of 55 stocks, coming from the top 5 stocks in 11 industries.
The full list of stocks and their companies can be found in stocktable.pdf.
This dataset comprises two main components,
- ./tweet: Tweet data from Twitter
- ./price: Price data from Yahoo Finance
We collect data in the same format as the Stocknet Dataset.
As the number of tweets have increased exponentially, we also employed a clustering pipeline to obtain the most representative tweets for each day.
Format: JSON
Keys: see Introduction to Tweet JSON
Format: JSON
Keys: 'text', 'created_at', 'user_id_str'
Format: CSV
Entries: date, open price, high price, low price, close price, adjusted close price, volume
Format: TXT
Entries: date, close price, open price, high price, low price, close price change, volume
Note: open, high, low, close prices are normalized with the last close price,
If you use this dataset, please cite our paper.
@inproceedings{koa2024learning,
title={Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models},
author={Koa, Kelvin J.L. and Ma, Yunshan and Ng, Ritchie and Chua, Tat-Seng},
booktitle={Proceedings of the ACM on Web Conference 2024},
pages={4304–4315},
year={2024}
}