Pre-training parameters

Question

Pre-training parameters

Closed this issue 5 years ago · 5 comments

Hi,

I'm currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt.

@ibeltagy Could you confirm that these parameters are up-to-date 🤔

Loss seems to be fine, but I'm just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.

Answer 1 · 2019-10-18T12:54:19.000Z

Yes, these are the same parameters I used for training. It is possible that our corpus is more difficult because of the pdf parsing noise. I can find the learning curves for you if you think it is going to be useful

Answer 2 · 2019-10-25T08:41:41.000Z

Hi @ibeltagy thanks for your reply :) I've another question. The initial pre-training starts with a sequence length of 128:

python3 run_pretraining.py \
--input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_128/*.tfrecord \
--output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_128  \
--do_train=True --do_eval=True \
--bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json \
--train_batch_size=256 --max_seq_length=128 \
--max_predictions_per_seq=20 --num_train_steps=500000 \
--num_warmup_steps=1000 --learning_rate=1e-4 --use_tpu=True \
--tpu_name=node-3 --max_eval_steps=2000  --eval_batch_size 256  \
--init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 \
--tpu_zone=us-central1-a

Can you explain where the init_checkpoint comes from (because it is actually the same path as used for the output_dir) 🤔

Answer 3 · 2019-10-27T21:10:52.000Z

It is a randomly initialized checkpoint. I run run_pretraining.py without --init_checkpoint for the script to generate a randomly initialized model and saves it to --output_dir , kill the script, and then run it again with --init_checkpoin=output_dir

Answer 4 · 2019-11-07T14:22:52.000Z

Thanks Iz ❤️ Just a last question on that pre-training topic: what was the number of tfrecords (and their corresponding text size per shard) 🤔

Answer 5 · 2019-11-07T14:34:17.000Z

250 tfrecords, each file is 800-900MB (around 4000 papers)