About GPU OOM

Question

About GPU OOM

Closed this issue 4 months ago · 3 comments

Hi,

I ran "llama2_7b.sh" following your steps on a server with 3 available A100/80GB, but found with your default deepspeed option--per_device_train_batch_size 4 GPU will go OOM, the maximum I can have to set to --per_device_train_batch_size 3. I wonder if this is the expected behavior?

Thanks

Answer 1 · 2024-06-08T15:19:20.000Z

Hi,

I ran "llama2_7b.sh" following your steps on a server with 3 available A100/80GB, but found with your default deepspeed option--per_device_train_batch_size 4 GPU will go OOM, the maximum I can have to set to --per_device_train_batch_size 3. I wonder if this is the expected behavior?

Thanks

3 A100/80GB to perform the knowledge distillation process may be a relative low-resource. Maybe 3 is OK (if there is no OOM), but I do not know this. :-)

Answer 2 · 2024-06-08T16:15:10.000Z

Thank you!

Answer 3 · 2024-06-08T16:18:38.000Z

Yes for --per_device_train_batch_size 3, 3 A100/80G seems OK, the GPU RAM usage go up to 80406/81920 MiB.