jpuigcerver/Laia

Process mini-batch as N blocks to reduce GPU memory usage

mauvilsa opened this issue · 6 comments

In train.lua, provide the possibility to divide each mini-batch in N blocks, processing each block in GPU sequentially. This is so that training can be performed for large mini-batches (relative to the amount of GPU memory) in GPUs with limited memory, or to leave room for other processes using the GPU.

This could also be done dynamically for each epoch in order to prevent the execution of train.lua to fail due to lack of memory. At the start of the epoch, the N for the number of blocks is determined based on the amount of free GPU memory that is available in that moment.

That would be great, however, keep in mind that mini-batches do not only affect the speed-memory tradeoff, but also have an effect on what the network learns, due to the white padding.
For now, I think that the safest option would be to show a WARNING when any of the tools starts.

On the long term, once we have dealt properly with padding to avoid that it (barely) affects the output of the network, then we can do the batch automatic splitting.

My main concern is that, if we do the automatic splitting for batches, depending on memory, we can get different results for apparently the same configuration...

One way or another, we need to implement a method for each of the layers that we use that estimate the memory costs, given the input size (not only batch size, but also batch dimensions). It should be fairly simple to extend the methods available for the standard nn/cudnn layers, since "classes" are just tables in lua. utilities.lua has some examples for the "table" class, for a nn Module would be something like:

require 'cudnn'

function cudnn.SpatialConvolution:estimateMemoryRequirements(input_size)
  -- Compute here the memory requirements of this module, based on the
  -- number of parameters and the input size.
  return ....
end

And for the "container" modules (like nn.Sequential), something like this (roughly):

require 'nn'

function nn.Sequential:estimateMemoryRequirements(input_size)
  total_mem = 0
  for module in self.modules:
    -- Watch out, because here we should update the input_size if it changes from module
    -- to module, which is actually the case for pooling layers, for instance.
    total_mem = total_mem + module:estimateMemoryRequirements(input_size)
  return total_mem
end

In the comment before, I meant at the start of each mini-batch not epoch.

A clarification regarding the blocks. The mini-batch size would be the same always (i.e. only one gradient update), only that in GPU each mini-batch is split into N blocks. At the start of the mini-batch the image widths are known, thus this can be implemented so that it gives exactly the same result as if the whole mini-batch is processed in one block.

You are right. Go for it!

@mauvilsa I'm not sure if you are aware, but I implemented this feature a few months ago. The CTCTrainer class accepts an argument (batchChunkSize, or --batch_chunk_size from the command line), to split the batch in each iteration into a variable number of chunks of batch_chunk_size MB. This is useful to limit the maximum amount of MB to process in each batch.

The gradients are accumulated among all chunks to perform a single parameter update.

Nice! I will try to test this soon.