ThilinaRajapakse/BERT_binary_text_classification

Gradient Accumulation Steps

Magpi007 opened this issue · 4 comments

Hi Thilina,

I am trying to track the process of training, and there are two steps that I can't understand. The first is when computing the loss:

if GRADIENT_ACCUMULATION_STEPS > 1: loss = loss / GRADIENT_ACCUMULATION_STEPS

And the second is when optimizing (I cant get the "global_step" variable utility):

(step + 1) % GRADIENT_ACCUMULATION_STEPS == 0: global_step += 13

It would be possible to clarify this operations over the gradient?

Thanks!

The first step is performing gradient accumulation. It's very useful when you are working with large models, as it allows us to sidestep memory limitations on the GPU when trying to increase batch sizes. The basic idea is instead of calling optimizer.step() per batch, you allow the gradient (or rather the loss) to accumulate for a given number of steps (the loss at each step is divided by the number of gradient accumulation steps) and call optimizer.step() once the specified number of accumulation steps are reached. This simulates the effect of using a larger batch size but without the associated memory cost. For example, if your GPU memory isn't enough to support a batch size of 16, you could alternatively use a batch size of 2 and 8 gradient accumulation steps.

The second part is checking to see whether the gradient/loss has been accumulated for the specified number of times before calling optimizer.step().

And for the second part, where else we use global_step, and why we choose 13?

I'm not sure where there is a 13. Isn't that from the training loop? This is what I have.

        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

Oh sorry, my mistake when writing it.... thanks anyway