yashkant/spad

About CUDA Memory Usage During Training

Closed this issue · 2 comments

Hi, thanks for your excellent work.

I‘m curious about how much CUDA memory is used when the model is trained on images with a resolution of 256x256 (as mentioned in the paper). Is it possible to train successfully at higher resolutions, such as 512x512?

hi, thanks for checking out spad!

we used H100s 80GB cards to train our models, and were able to fit a batch of 10 samples (40 images) when training 4-view model, and a batch of 36 samples (72 images) when training 2-view model on one GPU. also, we had to keep full-precision for attention layers during training, to prevent training divergence (nan losses).

increasing number of views in training leads to sharper decrease in number of images per batch because of the quadratic complexity of self-attention and unoptimized implementation of epipolar attention in xformers (based on masking, here). i believe it is possible to optimize it further.

i believe it would be feasible to train a 2-view model at 512x512 with SD1/2. also, i think stable-cascade may alleviate this issue since it has a smaller latent space, however I have not tried it myself.

hope this helps!

Thanks for your reply, it's very helpful for me.