sangmichaelxie/doremi

step 1 baseline_280M loss large

gawei1995 opened this issue · 5 comments

280M baseline model loss is hovering around 5, with all training hyperparameters set default values.
The preprocess file sampler is set to 10w
image

image

i find the bug trainer parameters is 40M , not 380M ,but why ????

image

image

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

thank you for replying to me,but when i use the gpt2 official version not ur gpt2fast version, the loss reached 2.4. so i think it's a model architecture issue

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

step1 loss when use gpt2 official version

image

Actually, you're right, thanks for pointing this out. I've replaced the model in the current HEAD; you'll need to run bash scripts/setup_flash.sh before running run_pile_baseline280M.sh again. In prelim tests the loss goes much lower than before.