Out of memory error in the code while running on a gpu
Closed this issue · 3 comments
karthik1coder commented
I kept the batch size at 1 and after few iterations there is out of memory issue when executed on GPU.
danbochman commented
For training an 80GB GPU such as A100 or H100 is needed. Compiling the model with torch.compile
can reduce the memory footprint significantly.
karthik1coder commented
I am using an A10 machine which has 8 gpus of 24gb.Can we use accelerator deepspeed and run this code on gpu?
danbochman commented
Perhaps with DeepSpeed stage 3 you might be able to do this, but I wasn't able to integrate DeepSpeed in this repository.
It will be much better for you to use a single A100 80GB than x8 A10 24GB GPUs. Also in terms of costs if you're renting cloud compute and it is not a physical server.