[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot

Question

[BUG?] Higher "gradient_accumulation_steps" still increases memory usage a lot

exnx opened this issue 5 months ago · 3 comments

Hello, I am seeing a high increase in GPU memory usage the larger I make gradient_accumulation_steps. For example, I can fit a a desired sequence length with gradient_accumulation_steps at 1 and 4, but at 8, I get an out of memory error.

I am using 64 gpus (16 nodes). Here's the memory usage as I increase the gradient_accumulation_steps:

grad accum -> memory
1 -> 54 GB
2 -> 60.6 GB
4 -> 70.4 GB
8 -> OOM

My understanding was that gradient_accumulation_steps is generally decoupled from memory increase, but that's not what I'm seeing. ie, I can use a high gradient_accumulation_steps and it just takes longer, but it shouldn't use that much more memory.

I am wondering if this phenomenon depends on the model and pipeline parallel values too? I am generally using 8 and 8, but I've tried other settings too. My best guess is that with parallelization, there needs to be more communication to send these accumulated gradients around? I tested on a single node, and the memory generally stays flat as you increase the gradient_accumulation_steps.

Is anyone else experiencing this, or know if this is accurate?

Thanks!

Answer 1 · 2024-01-15T13:35:29.000Z

Are you holding the microbatch size fixed? Or are you decreasing it as you increase gradient accumulation?

Answer 2 · 2024-01-15T16:57:53.000Z

Hi @StellaAthena!

I'm increasing the total batch size by the gradient accumuation factor only. The micro batch size is just 1 actually, always, in my case.

Answer 3 · 2024-01-18T02:05:40.000Z

Would be great to hear if anyone else experienced this in general too, or if I'm a crazy person. Thanks!