Tokenization file missing
Closed this issue · 2 comments
humzaiqbal commented
Hi! I am trying to run tokenization on my processed data by running the following script
python3 ray_processing/tokenize_shuffle.py --source_ref_paths /processed/json \
--readable_name c4_v4 \
--output /path/to/output \
--content_key "text" \
--do_sample
and I get this error
FileNotFoundError: [Errno 2] No such file or directory: tokenization_configs/rpj_lm_data.yaml'
looking at the tokenization script it seems like there should be a tokenization_configs
directory which has this file. Is this supposed to be available to us? Would be useful to make sure I have the right tokenization configs. Thanks!
GeorgiosSmyrnis commented
Hi @humzaiqbal!
This file is a yaml file that defines upsampling / downsampling ratios for your source data, and is only ever used with the --do_sample
flag. If you don't do any upsampling / downsampling, then you don't need either the flag or the file.
Hope this helps!
humzaiqbal commented
Gotcha thanks!