mlfoundations/dclm

What is the pretrain scripts?

Opened this issue ยท 12 comments

Thank you for your excellent work. If I want to use this data for pretraining and conduct a rigorous comparison with the DCLM-BASELINE 7B model mentioned here, what hyper-parameters should I use? Could you provide the corresponding script? Thank you.

Hi @mathfinder ,

You can find the training configuration files under training/configs. E.g. for the 7B-1x scale, the corresponding config with all the hyperpameters is https://github.com/mlfoundations/dclm/blob/main/training/configs/7b_1x_fast_2e-3_lr_5e-6_zloss.json

Please let us know if the above helps!

A small follow-up: for the largest model that we trained (beyond the competition scales), we used the same hyperparameters as the 7B-2x scale, along with the cooldown process described in Appendix P of our paper.

Thanks for your reply, but I am still confused.

The following pre-training script template

torchrun --nproc-per-node 8 -m training.train --scale <scale> <tokenized_json> --logs <log_dir> [--remote-sync <s3_bucket>] [--chinchilla-multiplier <multiplier>] [--clean-exp] [--report-to-wandb]

Which one should be filled in <tokenized_json> to align with the results reported by leaderboard?

image
I guess data-config is exp_data/datasets/tokenized/c4_original.json? Just like the following script:
torchrun --nproc-per-node 8 -m training.train --scale="7b_2x_fast_2e-3_lr_5e-6_zloss" --data-config="exp_data/datasets/tokenized/c4_original.json" --report-to-wandb

Hi @mathfinder ,

For the DCLM-baseline dataset, you will need to first use DCLM-baseline from here: https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html and create an untokenized json file similar to the ones found in exp_data/datasets/raw_sources.

After doing so, you should tokenize it using the instructions in this repository. This will produce a new json file under exp_data/datasets/tokenized, which you can then use as tokenized_json for --data-config.

Hi @GeorgiosSmyrnis ,
I have downloaded the datasets by the following code:

import os
import sys
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
pattern = None
if len(sys.argv) > 1:
    # "sample/350BT/008_*"
    pattern = sys.argv[1]
    print(f'{"#"*30} {pattern} {"#"*30}')
snapshot_download(repo_id="mlfoundations/dclm-baseline-1.0",
                  repo_type="dataset",
                  revision="main",
                  allow_patterns=pattern,
                  local_dir="path/to/dclm",
                  local_dir_use_symlinks=False,
                  resume_download=True,
                  max_workers=32)

And then I use the following script to tokenize the datasets:

python ray_processing/tokenize_shuffle.py \
--input /path/to/untokenized_data \
--readable_name dclm \
--output /path/to/tokenized_data \
--content_key text

If I tune do_sample, the code will look for the non-existent file DCLM/ray_processing/tokenization_configs/rpj_lm_data.yaml. Should I need tune on it?

My aim is to reproduce the highlighted experiment below.

image

And by the way, could you please provide the tokenized and shuffled dataset so that we can directly reproduce the experiment?

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

It will be very helpful for reproduction with multiple version of data, especially the sampled ones!

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

Oh, that sounds amazing! I'm really looking forward to seeing your progress.

It seems that ray_processing/tokenize_shuffle.py heavily depends on S3, so using a local dataset will require a lot of changes.

Is it necessary to turn off --no_shuffle? If I use spark to tokenize, can I reproduce your experiment without keeping the same shuffle process as yours?

Hi @mathfinder ,

This script should work on local datasets as well, as long as you spin up a ray cluster locally. Are you encountering any specific errors when trying to do so?