Out of memory with default configs/train.json on 4*24GB GPU

Question

Out of memory with default configs/train.json on 4*24GB GPU

shenyang0111ucf opened this issue a year ago · 9 comments

Hi @JiexingQi I found you asked similar question here: ServiceNow#29. I tried to train t5-3b to use CUDA_VISIBLE_DEVICES="0,1,2,3" python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 seq2seq/run_seq2seq.py configs/train.json with even config like this:
"per_device_train_batch_size": 1,
"per_device_eval_batch_size": 1,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
But I still got out of memory error and all four GPUs' memory are used up (about 22GB used for each of the GPU)
I think you must have some similar experience when using picard code. Could you show me how you solve this annoying out of memory problem? Thank you!

Answer 1 · 2023-02-05T11:32:46.000Z

Hi, @shenyang0111ucf which type of GPU do you use?

Answer 2 · 2023-02-05T11:48:39.000Z

Hi, @shenyang0111ucf which type of GPU do you use?

@JiexingQi 4 NVIDIA TITAN RTX 24GB card.

Answer 3 · 2023-02-05T11:51:54.000Z

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

Answer 4 · 2023-02-05T12:00:34.000Z

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

I tried to use four 24GB graphic cards instead of one 40 GB A100 to train the model, do you have any experience with "torch.distributed.launch" to make it happen?

Answer 5 · 2023-02-05T12:02:50.000Z

may be you could try model parallel in this situation, but I did not have a try.

Answer 6 · 2023-02-05T12:05:40.000Z

may be you could try model parallel in this situation, but I did not have a try.

Ok, I will try to find out how to fix this problem by model parallel. Thank you for your time!

Answer 7 · 2023-02-05T12:06:21.000Z

You are welcome!

Answer 8 · 2023-06-23T15:44:37.000Z

Excuse me, regarding the question raised by @shenyang0111ucf , I would like to ask if the T5-3B can run on 4 NVIDIA GeForce RTX 3090 graphics cards. Each graphics card also has 24GB. Thank you. @JiexingQi

Answer 9 · 2023-06-25T02:46:31.000Z

24 GB memory GPU seems not able to train the T5-3B model, we use 40 GB A100 to train it (the same as PICARD). By the way, the evaluation can be run at 24 GB 3090 GPU.

I think it is not enough for training but worked for evaluation.