Cambridge-ICCS/FTorch

CUDA out of memory error for very long runs

Closed this issue · 3 comments

This may be an issue with the implementation separate to FTorch, but from very large tests on GPUs (~100,000 iterations), I sometimes start to run into CUDA memory issues for the cgdrag benchmark example.

This example calls torch_tensor_delete after every iteration, but perhaps this is not cleaning up data on the GPU?

Full error (after running ./benchmarker_cgdrag_torch ../cgdrag_model saved_cgdrag_model_gpu.pt 100000 10 --use_cuda for ~32000 iterations):

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 18.00 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 78.63 GiB is allocated by PyTorch, and 5.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This might be due to the way in which memory is managed in Torch. It is not unusual for libraries to have their own GPU memory management which does not call cudaFree as part of the object (in this case Tensor object) destructor. A similar issue was reported in PyTorch in which it is advised to call torch.cuda.empty_cache(). The C++ equivalent looks like at::cuda::CachingHostAllocator_emptyCache(); (from unit test example). If this works then we will have to have our own FTorch library exit function which can force Torch to do the clean up. Either that or do some funky stuff with a singleton.

This was an issue of usage rather than with the FTorch library itself.

Reusing a torch_tensor, then calling torch_tensor_delete e.g.

in_tensors(1) = torch_tensor_from_array(data_1, stride_1, device_1)
...
...
in_tensors(1) = torch_tensor_from_array(data_2, stride_2, device_2)
torch_tensor_delete(in_tensors(1))

results in only the latter tensor being deleted from memory. The solution is to ensure to either use a unique torch_tensor for each tensor, or call torch_tensor_delete before creating the new tensor.

See this change (and full example) for reference.

Perhaps this could be documented via #13, if we do adapt cgdrag?

Agreed, I will put a note on #13 and close this now.
Well found!