Lightning-Universe/Pose-app

CUDA out of memory error

MichaelGMoore opened this issue · 0 comments

I am running the full app on Ubunto 22.10 with RTX3060 12Gb GPU running CUDA 12.0.

I launch the app with the command:

lightning run app app.py --env NVIDIA_DRIVER_CAPABILITIES=compute,utility,video

I was able to train the "supervised" model, but when attempting to train the "semi-super" model, the app crashes. The terminal shows this just before the crash:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.76 GiB total capacity; 9.16 GiB already allocated; 38.31 MiB free; 9.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Clearly the error is making a suggestion, but I am not knowledgeable enough to know how to adjust that parameter in practice. Any suggestions?