fashn-AI/tryondiffusion

Out of memory error in the code while running on a gpu

Closed this issue · 3 comments

I kept the batch size at 1 and after few iterations there is out of memory issue when executed on GPU.

For training an 80GB GPU such as A100 or H100 is needed. Compiling the model with torch.compile can reduce the memory footprint significantly.

I am using an A10 machine which has 8 gpus of 24gb.Can we use accelerator deepspeed and run this code on gpu?

Perhaps with DeepSpeed stage 3 you might be able to do this, but I wasn't able to integrate DeepSpeed in this repository.
It will be much better for you to use a single A100 80GB than x8 A10 24GB GPUs. Also in terms of costs if you're renting cloud compute and it is not a physical server.