reddit2dialog was created for NLPer who are interested in dialogue model but lack pretraining data.
With reddit2dialog, you can easily download reddit comments and turn it to dialogue pair format (context, response).
Make sure you have the following dependencies installed.
- python>=3.8
- transformers
- requests
- bs4
- lxml
- msgspec
- zstandard
- tqdm
pip install -r requirements.txt
First, download the reddit comments data you need:
python download.py \
-sy 2021 \ # start year
-ey 2022 \ # end year
-sm 5 \ # start month
-em 5 \ # end month
-o data_dir/ # output directory
Process the data just downloaded:
python process.py \
-sy 2021 \ # start year
-ey 2022 \ # end year
-sm 5 \ # start month
-em 5 \ # end month
-o data_dir/ # output directory
--valid_split_percentage 0.0002 # validation dataset percentage