Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3
saptarshidatta96 opened this issue · 3 comments
Hi,
I am running on 4x Tesla T4. So, vRAM size is around 4*16 = 64 GB. Azure VM being used is NC64as_T4_v3.
the command I am running to execute is:
torchrun --nnodes=1 --nproc-per-node=4 train.py
I an getting the below error across all the 4GPUs. A sample error for GPU3 is as below:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.49GiB. GPU3 has a total capacity of 14.58 GiB of which 233.75MiB is free.
I was of the impression that the model would be distributed across the 4 GPUs with a cumulative RAM sixe of 64 GB and I would not need to use qLORA for FT.
Can you please tell me if I am missing something?
Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM.
I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)
[Execution results for 3 epochs]
Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.
I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)
[Execution results for 3 epochs]
Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.
Thank you for your test! I was able to run the code on my quad-4090s setup now (with both batch size = 1). Though on quad 4090s, the performance may not be satisfactory due to limited card to card bandwidth without NVLinks.