Out of memory with default configs/train.json on A100 GPU

Question

Out of memory with default configs/train.json on A100 GPU

lanwuwei opened this issue 3 years ago · 8 comments

Hi Torsten,

I just got a node with 8 A100 GPUs, when I run your code with default setting (make train) on one A100 (40G memory), I encountered out of memory error. Did you set up other hyper-parameters (e.g. fp16, gradient checkpointing, deep speed) to make it work? When I reduced the batch size to 1, it can work. However, if I scale up to 8 A100 GPUs (DDP), even with batch size of 1, there is still out of memory error.

Best,
Wuwei

Answer 1 · 2021-11-26T13:10:52.000Z

Oh, yes, you are right, the current configuration will lead to OOM errors. I think I forgot to activate gradient checkpointing in the config. Alternatively, use a per-device batch size of 1.

Answer 2 · 2021-11-26T13:17:32.000Z

There is an overhead when you're using DDP which prevents it to work with more than one gpu and bs 1. Gradient checkpointing is the way to go. Deepspeed may work as well, but I never got it to work better than DDP with gradient checkpointing.

Answer 3 · 2021-11-26T16:07:33.000Z

Got it, thanks Torsten!

Answer 4 · 2021-12-13T02:39:27.000Z

@lanwuwei Hi, do you success by setting "gradient_checkpointing":true?
I set it but still lead to OOM errors.
DO you use the following command:
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=8 run_seq2seq.py configs/train.json?

Answer 5 · 2021-12-14T13:45:30.000Z

me too， does "gradient_checkpointing":true? works？@ @tscholak

Answer 6 · 2023-02-05T12:18:00.000Z

@lanwuwei I am facing the similar out of memory issue with 4 graphic cards. Could you tell me how did you solve your problem? Thank you!

Answer 7 · 2023-02-09T23:40:37.000Z

Have you tried gradient checkpointing? Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: shenyang0111ucf ***@***.***> Sent: Sunday, February 5, 2023 4:18:12 AM To: ServiceNow/picard ***@***.***> Cc: Wuwei Lan ***@***.***>; Mention ***@***.***> Subject: Re: [ServiceNow/picard] Out of memory with default configs/train.json on A100 GPU (Issue #29) @lanwuwei<https://github.com/lanwuwei> I am facing the similar out of memory issue with 4 graphic cards. Could you tell me how did you solve your problem? Thank you! — Reply to this email directly, view it on GitHub<#29 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACDANXBSKAKPREVDZNUJR7TWV6LAJANCNFSM5I2BIBDA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 8 · 2023-02-10T00:12:55.000Z

If you mean change the setting to "gradient_checkpointing :true" in the json file, then yes. Or maybe you have other ways to deal with this gradient checkpointing?

Have you tried gradient checkpointing? Get Outlook for iOShttps://aka.ms/o0ukef
…
________________________________ From: shenyang0111ucf @.> Sent: Sunday, February 5, 2023 4:18:12 AM To: ServiceNow/picard @.> Cc: Wuwei Lan @.>; Mention @.> Subject: Re: [ServiceNow/picard] Out of memory with default configs/train.json on A100 GPU (Issue #29) @lanwuwei https://github.com/lanwuwei I am facing the similar out of memory issue with 4 graphic cards. Could you tell me how did you solve your problem? Thank you! — Reply to this email directly, view it on GitHub<#29 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACDANXBSKAKPREVDZNUJR7TWV6LAJANCNFSM5I2BIBDA. You are receiving this because you were mentioned.Message ID: @.***>