CUDA – memory spanning multiple devices and a question about cuda_global_memory
Opened this issue · 1 comments
Hi there,
Thanks for making a fantastic bit of code. I've got a question about what to do if your problem won't fit into GPU ram on one device, but will on two. You have reference to using cudaMallocManaged
, which I naïvely understand would handle host/GPU page faulting in case the problem is too large, and there is much pain (!) to be had in order to span across multiple devices and keep them synchronised.
a) You have explicitly disabled this and gone for manual memory management by hardcoding a parameter (cuda_global_memory
) as false. Is this for performance reasons? I do find that CPU-only approaches are indeed much faster.
b) Do you have any plans to permit multi-GPU usage and trying to span them with some sort of NUMA architecture? My problem is about 110 GB in ram – don't ask! – and I realise this is a huge amount of work and the answer is probably 'no'.
Thanks for your help,
Hi there,
a) In my experience, there are some rare situations where global memory is slower than pure GPU memory. For example, host to device copies in my experience do not overlap with compute tasks. Before we had some tools using global memory and some didn't. We unified this and turned it off by default. However, you can activate global memory by an environment variable, namely by setting BART_GPU_GLOBAL_MEMORY=1
This does not seem to be documented, but we'll add it soon.
b) We have support for multi-gpu based on MPI. So far, it is available in the pics tool via command line options and for training in the deep-learning tools (reconet and nlinvnet). If you tell us more, maybe your problem is already covered.
Best,
Moritz