mlfoundations/dclm

Tokenization file missing

Closed this issue · 2 comments

Hi! I am trying to run tokenization on my processed data by running the following script

python3 ray_processing/tokenize_shuffle.py --source_ref_paths /processed/json \
        --readable_name c4_v4 \
        --output  /path/to/output \
        --content_key "text" \
        --do_sample

and I get this error

FileNotFoundError: [Errno 2] No such file or directory: tokenization_configs/rpj_lm_data.yaml'

looking at the tokenization script it seems like there should be a tokenization_configs directory which has this file. Is this supposed to be available to us? Would be useful to make sure I have the right tokenization configs. Thanks!

Hi @humzaiqbal!

This file is a yaml file that defines upsampling / downsampling ratios for your source data, and is only ever used with the --do_sample flag. If you don't do any upsampling / downsampling, then you don't need either the flag or the file.

Hope this helps!

Gotcha thanks!