中文 README:

Chinese blog about this project: 量化系列2 - 众包数据集

How to use it
Developement Setup
Initiative
Project Detail
- Data Source
- Initial loading and Validation logic for each table
Contribution Guide
- Add more stock index
- Add more data source or fields

Table of contents generated with markdown-toc

How to use it

Download tar ball from latest release page on github
Extract tar file to default qlib directory

wget https://github.com/chenditc/investment_data/releases/download/2023-10-08/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=1

Developement Setup

If you want to contribute to the set of scripts or the data, here is what you should do to set up a dev environment.

Install dolt

Follow https://github.com/dolthub/dolt

Clone data

Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data

To download as dolt database:

dolt clone chenditc/investment_data

Export to qlib format

docker run 
  -v /<some output directory>:/output
  -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

You can use the following parameter to mount an existing dolt chenditc/investment_data folder to the container.

  -v /<dolt directory>:/dolt

Run Daily Update

You will need tushare token to use tushare api. Get tushare token from https://tushare.pro/

export TUSHARE=<Token>
bash daily_update.sh

Daily update and output

docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

Extract tar file to qlib directory

tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=1

Initiative

Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
Try to correct data by cross validate against multiple data source.

Project Detail

Data Source

The database table on dolthub is named with prefix of data source, for example ts_a_stock_eod_price. The meaning of the prefix:

w(wind): high quality static data source. Only available till 2019.
c(caihui): high quality static data source. Only available till 2019.
ts: Tushare data source
ak: Akshare data source
yahoo: Use Qlib's yahoo collector https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
baostock: Baostock
final: Merged final data with validation and correction

Initial loading and Validation logic for each table

Dolt Table Explanation

ts_link_table

The initial date for each stock might be different, when we calculate the adjusted price, we are using the first date price as adjust factor = 1.0.

In order to merge different data sources, we need to rescale the adjust factor, so that each data source will have the same adjusted price.

Each data source will have a dedicated link table, which is generated by:

If the final_a_stock_eod_price already has this stock, adjust_ratio = final_a_stock_eod_price.adjust_price / current_data_source.adjust_price
If the stock is new to final_a_stock_eod_price, adjust_ratio = 1.

Data validation needs to run to verify if the adjust factor match between each data source:

data_source_1.adjust_ratio * data_source_1.adjust_price = final_a_stock_eod_price.adjust_price

Contribution Guide

Add more stock index

To add a new stock index, we need to change:

Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit

Add more data source or fields

Please raise an issue to discuss the plan, example issue: chenditc#11

It should includes:

Why do we want this data?
How do we do regular update?
- Which data source would we use?
- When should we trigger update?
- How do we validate regular update complete correctly?
Which data source should we get historical data?
How do we plan to validate the historical data?
- Is the data source complete? How did we verify this?
- Is the data source accurate? How did we verify this?
- If we see error in validation, how will we deal with them?
Are we changing exisiting table or adding new table?

If the data is not clean, we might try hard to dig insight from it and find incorrect insight. So we want high quality data instead of just data.

animic/investment_data