allenai/scibert

Pre-training parameters

Closed this issue · 5 comments

Hi,

I'm currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt.

@ibeltagy Could you confirm that these parameters are up-to-date 🤔

Loss seems to be fine, but I'm just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.

Yes, these are the same parameters I used for training. It is possible that our corpus is more difficult because of the pdf parsing noise. I can find the learning curves for you if you think it is going to be useful

Hi @ibeltagy thanks for your reply :) I've another question. The initial pre-training starts with a sequence length of 128:

python3 run_pretraining.py \
--input_file=gs://s2-bert/s2-tfRecords/tfRecords_s2vocab_uncased_128/*.tfrecord \
--output_dir=gs://s2-bert/s2-models/3B-s2vocab_uncased_128  \
--do_train=True --do_eval=True \
--bert_config_file=/mnt/disk1/bert_config/s2vocab_uncased.json \
--train_batch_size=256 --max_seq_length=128 \
--max_predictions_per_seq=20 --num_train_steps=500000 \
--num_warmup_steps=1000 --learning_rate=1e-4 --use_tpu=True \
--tpu_name=node-3 --max_eval_steps=2000  --eval_batch_size 256  \
--init_checkpoint=gs://s2-bert/s2-models/3B-s2vocab_uncased_128 \
--tpu_zone=us-central1-a

Can you explain where the init_checkpoint comes from (because it is actually the same path as used for the output_dir) 🤔

It is a randomly initialized checkpoint. I run run_pretraining.py without --init_checkpoint for the script to generate a randomly initialized model and saves it to --output_dir , kill the script, and then run it again with --init_checkpoin=output_dir

Thanks Iz ❤️ Just a last question on that pre-training topic: what was the number of tfrecords (and their corresponding text size per shard) 🤔

250 tfrecords, each file is 800-900MB (around 4000 papers)