
[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot

exnx opened this issue · 3 comments

exnx commented

Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.

I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:

grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM

My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.

I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.

Is anyone else experiencing this, or know if this is accurate?


Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?

exnx commented

Hi @StellaAthena!

I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.

exnx commented

Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!