[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot
exnx opened this issue · 3 comments
Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps
. For example, I can fit a a desired sequence length with gradient_accumulation_steps
at 1 and 4, but at 8, I get an out of memory error.
I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps
:
grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM
My understanding was that gradient_accumulation_steps
is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps
and it just takes longer, but it shouldn't use that much more memory.
I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps
.
Is anyone else experiencing this, or know if this is accurate?
Thanks!
Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?
Hi @StellaAthena!
I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.
Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!