Nvidia H100 - Fine tune Dolly Failure - "Loss" and "learning rate" are start at 0 and remain 0 continually
GeorgiAngelov opened this issue · 1 comments
GeorgiAngelov commented
I have successfully trained the model on my fine-tune data set on an A6000 GPU. However, I tried an H100 and I am getting weird results:
Command: deepspeed --num_gpus=1 --module training.trainer --deepspeed config/a100_config.json --local-output-dir testOutputDir --per-device-train-batch-size 2 --input-model databricks/dolly-v2-3b --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 10 --lr 5e-6 --epochs 2 --training-dataset data/testData.json --bf16 true
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.24}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.49}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.73}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.98}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 1.22}
{'eval_loss': 9.275385111957112e-39, 'eval_runtime': 1.5501, 'eval_samples_per_second': 6.451, 'eval_steps_per_second': 1.29, 'epoch': 1.22}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 1.46}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 1.71}
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 1.95}
{'train_runtime': 135.9952, 'train_samples_per_second': 1.206, 'train_steps_per_second': 0.603, 'train_loss': 0.0, 'epoch': 2.0}```
As I mentioned, the training works great on A6000 ( using the V100 config ).
srowen commented
Weird, never seen that one. Did you modify the code or config at all?