EleutherAI/gpt-neox

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot

exnx opened this issue · 3 comments

exnx commented

Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.

I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:

grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM

My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.

I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.

Is anyone else experiencing this, or know if this is accurate?

Thanks!

Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?

exnx commented

Hi @StellaAthena!

I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.

exnx commented

Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!