step 1 baseline_280M loss large

Question

step 1 baseline_280M loss large

gawei1995 opened this issue 2 years ago · 5 comments

gawei1995 commented 2 years ago

280M baseline model loss is hovering around 5, with all training hyperparameters set default values.
The preprocess file sampler is set to 10w

Answer 1 · 2023-06-06T14:47:46.000Z

i find the bug trainer parameters is 40M , not 380M ,but why ????

Answer 2 · 2023-06-06T17:12:06.000Z

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

Answer 3 · 2023-06-07T03:04:50.000Z

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

thank you for replying to me，but when i use the gpt2 official version not ur gpt2fast version, the loss reached 2.4. so i think it's a model architecture issue

Answer 4 · 2023-06-07T03:05:46.000Z

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

step1 loss when use gpt2 official version

Answer 5 · 2023-06-08T22:59:45.000Z

Actually, you're right, thanks for pointing this out. I've replaced the model in the current HEAD; you'll need to run bash scripts/setup_flash.sh before running run_pile_baseline280M.sh again. In prelim tests the loss goes much lower than before.