step 1 baseline_280M loss large
gawei1995 opened this issue · 5 comments
The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.
The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.
thank you for replying to me,but when i use the gpt2 official version not ur gpt2fast version, the loss reached 2.4. so i think it's a model architecture issue
Actually, you're right, thanks for pointing this out. I've replaced the model in the current HEAD; you'll need to run bash scripts/setup_flash.sh
before running run_pile_baseline280M.sh
again. In prelim tests the loss goes much lower than before.