Pre-training parameters
Closed this issue · 5 comments
Hi,
I'm currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt
.
@ibeltagy Could you confirm that these parameters are up-to-date 🤔
Loss seems to be fine, but I'm just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.
Yes, these are the same parameters I used for training. It is possible that our corpus is more difficult because of the pdf parsing noise. I can find the learning curves for you if you think it is going to be useful
Hi @ibeltagy thanks for your reply :) I've another question. The initial pre-training starts with a sequence length of 128:
python3 run_pretraining.py \
--input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_128/*.tfrecord \
--output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 \
--do_train=True --do_eval=True \
--bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json \
--train_batch_size=256 --max_seq_length=128 \
--max_predictions_per_seq=20 --num_train_steps=500000 \
--num_warmup_steps=1000 --learning_rate=1e-4 --use_tpu=True \
--tpu_name=node-3 --max_eval_steps=2000 --eval_batch_size 256 \
--init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 \
--tpu_zone=us-central1-a
Can you explain where the init_checkpoint
comes from (because it is actually the same path as used for the output_dir
) 🤔
It is a randomly initialized checkpoint. I run run_pretraining.py
without --init_checkpoint
for the script to generate a randomly initialized model and saves it to --output_dir
, kill the script, and then run it again with --init_checkpoin=output_dir
Thanks Iz ❤️ Just a last question on that pre-training topic: what was the number of tfrecords (and their corresponding text size per shard) 🤔
250 tfrecords, each file is 800-900MB (around 4000 papers)