songlab-cal/gpn

How can I specify the GPU to run the program?

dahaigui opened this issue · 2 comments

After running for a while, an error occurred. How should I address this issue?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 365.88 MiB is free. Including non-PyTorch memory, this process has 21.62 GiB memory in use. Of the allocated memory 21.09 GiB is allocated by PyTorch, and 245.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hello! You should adjust the batch size based on the number and type of GPUs. We recommend a total batch size of 2048, but smaller might work as well.

In our example we have:

    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1

because we had 4 large GPUs, so the total batch size would be 4*512*1=2048.

If you have 4 smaller GPUs with half the memory, you can increase gradient_accumulation_steps:

    --per_device_train_batch_size 256 --per_device_eval_batch_size 256 --gradient_accumulation_steps 2

so total batch size stays 4*256*2=2048.

In general, you can find more information in the documentation for the Huggingface Trainer and TrainingArguments classes:
https://huggingface.co/docs/transformers/main_classes/trainer

Thank you very much, I'm already running now!