CUDA OOM error while saving the model

Question

CUDA OOM error while saving the model

aasthavar opened this issue 2 years ago · 10 comments

Hi @philschmid ! Thanks a lot for your blog on finetuning FLAN-T5-XXL with LoRA.
I was trying the same on custom dataset.
Some more details -

Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1

Training completes with this output -
{'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}

But getting CUDA OOM error at the point of saving the model.
Error -

ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”

Code -

trainer.save_model(os.environ["SM_MODEL_DIR"])
tokenizer.save_pretrained(os.environ["SM_MODEL_DIR"])

Do you have any suggestion on how to solve this error ?

Answer 1 · 2023-05-18T09:38:05.000Z

Can you try down samplig your dataset to see if the script works and if the issue the size of your dataset?

Answer 2 · 2023-05-18T09:56:43.000Z

I just tried with train dataset of 1000 samples. As well as batch_size=1, gradient_accumulation_steps = 4. But got the same error.

Answer 3 · 2023-05-18T14:46:44.000Z

Can you try with gradient_accumulation_steps=1? gradient_accumulation_steps increases the memory quite a bit. And which peft version are you using?

Answer 4 · 2023-05-18T15:09:37.000Z

Sure, just started the training job. Will post the updates here as soon as its completed. Currently using peft version - 0.3.0

Answer 5 · 2023-05-18T15:44:09.000Z

Got the CUDA error.
Details -

"OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 22.19
 GiB total capacity; 20.52 GiB already allocated; 8.50 MiB free; 20.99 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
 Management and PYTORCH_CUDA_ALLOC_CONF

Answer 6 · 2023-05-18T16:21:30.000Z

Thats really weird can you try with 0.2.0 with peft? will try to rerun it my self.

Answer 7 · 2023-05-18T16:49:56.000Z

Okay, Thank you !!

Answer 8 · 2023-05-18T19:27:09.000Z

Hi @philschmid . It works, followed your suggestion !!
This combination of peft==0.2.0 and accelerate==0.17.1 worked. I was using latest version of peft (0.3.0) and accelerate (0.19.0).

Final requirements.txt -

transformers==4.27.2
datasets==2.9.0
accelerate==0.17.1
evaluate==0.4.0
bitsandbytes==0.37.1
loralib
peft==0.2.0
pynvml

Thanks a lot for your suggestions !!

What I quite didn't understand is - how does CUDA error relate to bunch of libraries's versions. How could someone backtrack from the error to this solution ?

Answer 9 · 2023-05-19T05:51:50.000Z

Awesome. Pinned peft to 0.2.0

Answer 10 · 2023-10-02T05:47:45.000Z

Hello There,

I used these suggested version of the libraries, but it did not resolve the issue.

Training Args:

per_device_train_batch_size=1,
gradient_accumulation_steps=4

I am using bitsandbytes with below configurartion


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

On RTX 3090 24GB GPU.

But it does not resolve the issue.

The GPU VRAM utilisation gradually increases and then I get CUDA OOM error.

Any suggestions of how can I resolve this.

Thanks for the help.