Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3

Question

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3

saptarshidatta96 opened this issue a year ago · 3 comments

Hi,

I am running on 4x Tesla T4. So, vRAM size is around 4*16 = 64 GB. Azure VM being used is NC64as_T4_v3.

the command I am running to execute is:
torchrun --nnodes=1 --nproc-per-node=4 train.py

I an getting the below error across all the 4GPUs. A sample error for GPU3 is as below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.49GiB. GPU3 has a total capacity of 14.58 GiB of which 233.75MiB is free.

I was of the impression that the model would be distributed across the 4 GPUs with a cumulative RAM sixe of 64 GB and I would not need to use qLORA for FT.

Can you please tell me if I am missing something?

Answer 1 · 2024-04-03T02:55:21.000Z

Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM.

Answer 2 · 2024-04-06T23:58:01.000Z

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs]

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output]

Answer 3 · 2024-04-07T06:19:43.000Z

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs]

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output]

Thank you for your test! I was able to run the code on my quad-4090s setup now (with both batch size = 1). Though on quad 4090s, the performance may not be satisfactory due to limited card to card bandwidth without NVLinks.