CUDA OOM error while saving the model
aasthavar opened this issue · 10 comments
Hi @philschmid ! Thanks a lot for your blog on finetuning FLAN-T5-XXL with LoRA.
I was trying the same on custom dataset.
Some more details -
Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1
Training completes with this output -
{'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}
But getting CUDA OOM error at the point of saving the model.
Error -
ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”
Code -
trainer.save_model(os.environ["SM_MODEL_DIR"])
tokenizer.save_pretrained(os.environ["SM_MODEL_DIR"])
Do you have any suggestion on how to solve this error ?
Can you try down samplig your dataset to see if the script works and if the issue the size of your dataset?
I just tried with train dataset of 1000 samples. As well as batch_size=1, gradient_accumulation_steps = 4. But got the same error.
Can you try with gradient_accumulation_steps=1
? gradient_accumulation_steps increases the memory quite a bit. And which peft version are you using?
Sure, just started the training job. Will post the updates here as soon as its completed. Currently using peft version - 0.3.0
Got the CUDA error.
Details -
"OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 22.19
GiB total capacity; 20.52 GiB already allocated; 8.50 MiB free; 20.99 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
Thats really weird can you try with 0.2.0
with peft? will try to rerun it my self.
Okay, Thank you !!
Hi @philschmid . It works, followed your suggestion !!
This combination of peft==0.2.0 and accelerate==0.17.1 worked. I was using latest version of peft (0.3.0) and accelerate (0.19.0).
Final requirements.txt -
transformers==4.27.2
datasets==2.9.0
accelerate==0.17.1
evaluate==0.4.0
bitsandbytes==0.37.1
loralib
peft==0.2.0
pynvml
Thanks a lot for your suggestions !!
What I quite didn't understand is - how does CUDA error relate to bunch of libraries's versions. How could someone backtrack from the error to this solution ?
Awesome. Pinned peft to 0.2.0
Hello There,
I used these suggested version of the libraries, but it did not resolve the issue.
Training Args:
per_device_train_batch_size=1,
gradient_accumulation_steps=4
I am using bitsandbytes with below configurartion
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
On RTX 3090 24GB GPU.
But it does not resolve the issue.
The GPU VRAM utilisation gradually increases and then I get CUDA OOM error.
Any suggestions of how can I resolve this.
Thanks for the help.