Out of memory for the training of 1 layer of GeometricAttention

Question

Out of memory for the training of 1 layer of GeometricAttention

Zhang038 opened this issue 2 years ago · 0 comments

Hi,
I am curious about the high memory cost of the "class GeometricAttention(OFModule)", I tried to split it out and doing gradients backward, I noticed that it will cost much memory than I expected, maybe 30 or 40 GB on 1 target with sequence length=256. So I was wondering how do you train it with 50 blocks of Geoformer considering such a costly GPU momory needed.

Thanks a lot in advance!