nannau/DoWnGAN

Improve memory management

Opened this issue · 0 comments

Right now, DoWnGAN loads all of the test/train data onto the GPU directly. This is not ideal, but it does make training faster as it reduces time spent loading batches between the CPU and GPU. Unfortunately, this will make scaling the code challenging, as larger datasets/regions will be memory-limited...

The flow is like this: Storage --> CPU --> GPU --> train

"GPUDirect" could help with reducing this overhead: https://developer.nvidia.com/gpudirect

This could potentially change the flow to be: Storage --> GPU --> train which could reduce CPU overhead.

Furthermore, the current framework does not load NetCDFs in batches. It loads the entire NetCDF all at once. This will create issues for larger datasets, and also, there is no implementation currently to load from NetCDF --> GPU. There is also no implementation currently that loads chunks of the NetCDF into memory.

The purpose of this pull request is to:

  1. Implement data loaders that load the NetCDF4 file directly from memory in chunks and load them onto the CPU for training in the current pipeline
  2. Convert (1.) so that it can load chunks directly onto the GPU (if possible)