Out of memory error in the code while running on a gpu

Question

Out of memory error in the code while running on a gpu

Closed this issue 3 months ago · 3 comments

I kept the batch size at 1 and after few iterations there is out of memory issue when executed on GPU.

Answer 1 · 2024-04-17T09:27:38.000Z

For training an 80GB GPU such as A100 or H100 is needed. Compiling the model with torch.compile can reduce the memory footprint significantly.

Answer 2 · 2024-04-17T12:19:53.000Z

I am using an A10 machine which has 8 gpus of 24gb.Can we use accelerator deepspeed and run this code on gpu?

Answer 3 · 2024-04-18T11:37:18.000Z

Perhaps with DeepSpeed stage 3 you might be able to do this, but I wasn't able to integrate DeepSpeed in this repository.
It will be much better for you to use a single A100 80GB than x8 A10 24GB GPUs. Also in terms of costs if you're renting cloud compute and it is not a physical server.