ServiceNow/picard

Out of memory with default configs/train.json on A100 GPU

lanwuwei opened this issue · 8 comments

Hi Torsten,

I just got a node with 8 A100 GPUs, when I run your code with default setting (make train) on one A100 (40G memory), I encountered out of memory error. Did you set up other hyper-parameters (e.g. fp16, gradient checkpointing, deep speed) to make it work? When I reduced the batch size to 1, it can work. However, if I scale up to 8 A100 GPUs (DDP), even with batch size of 1, there is still out of memory error.

Best,
Wuwei

Oh, yes, you are right, the current configuration will lead to OOM errors. I think I forgot to activate gradient checkpointing in the config. Alternatively, use a per-device batch size of 1.

There is an overhead when you're using DDP which prevents it to work with more than one gpu and bs 1. Gradient checkpointing is the way to go. Deepspeed may work as well, but I never got it to work better than DDP with gradient checkpointing.

Got it, thanks Torsten!

@lanwuwei Hi, do you success by setting "gradient_checkpointing":true?
I set it but still lead to OOM errors.
DO you use the following command:
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=8 run_seq2seq.py configs/train.json?

me too, does "gradient_checkpointing":true? works?@ @tscholak

@lanwuwei I am facing the similar out of memory issue with 4 graphic cards. Could you tell me how did you solve your problem? Thank you!

If you mean change the setting to "gradient_checkpointing :true" in the json file, then yes. Or maybe you have other ways to deal with this gradient checkpointing?

Have you tried gradient checkpointing? Get Outlook for iOShttps://aka.ms/o0ukef

________________________________ From: shenyang0111ucf @.> Sent: Sunday, February 5, 2023 4:18:12 AM To: ServiceNow/picard @.> Cc: Wuwei Lan @.>; Mention @.> Subject: Re: [ServiceNow/picard] Out of memory with default configs/train.json on A100 GPU (Issue #29) @lanwuweihttps://github.com/lanwuwei I am facing the similar out of memory issue with 4 graphic cards. Could you tell me how did you solve your problem? Thank you! — Reply to this email directly, view it on GitHub<#29 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACDANXBSKAKPREVDZNUJR7TWV6LAJANCNFSM5I2BIBDA. You are receiving this because you were mentioned.Message ID: @.***>