Is Apex useful for GPT-2?

Question

Is Apex useful for GPT-2?

Closed this issue 4 years ago · 2 comments

hi, Is there a reduction in the size of the GPT-2 model when using Apex, is the inference speed of the model faster?

Answer 1 · 2020-10-29T09:43:09.000Z

Hi! Thanks for your issue! We are currently trying to remove dependencies about apex, but it still provides some useful tricks in training.

`apex.amp` supports full half precision mode

When using automatic mixed precision in training, you can choose the amp level for amp.initialize. While O1 mode patches the model to actual mixed precision, O2 mode casts all parameters of the model to torch.float16 without restoring activations to torch.float32. Although the O1 mode is used in typical cases, O2 shows remarkable memory reduction and performance enhancement (about 50% faster!). Recently, pytorch introduces native amp in PyTorch 1.6. However, it only supports for O1-like mode so my project still needs apex.amp rather than PyTorch amp. We are working on implementing the automatic half-precision conversion without apex.amp library (and actually it is simple to implement!).

Fused optimizers in `apex.optimizers` is quite faster

apex contains not only mixed precision, but also some fused optimizers and layers. PyTorch has many optimizers (and even fused for some), but they are slow for language models. We empirically found that apex.optimizers.FusedAdam is much faster for Transformer-based models.

Fused layer norm layer with large `hidden_dims`

As mentioned above, apex contains fused layer norm layer which is for large hidden_dims tensors. Typically, torch.nn.LayerNorm is faster than apex.normalization.FusedLayerNorm. But in Transformer-based models, the input tensors have large hidden_dims and it leads worse performance in torch.nn.LayerNorm. In our simple experiments, we found FusedLayerNorm is up to 50% faster than PyTorch's one.

Consequently, we chose apex for the above empirical reasons. Note that we tested those for PyTorch 1.5 and 1.6.

Answer 2 · 2020-11-18T09:17:06.000Z

@affjljoo3581 Thank you very much for your detailed and useful reply!

apex.amp supports full half precision mode

Fused optimizers in apex.optimizers is quite faster

Fused layer norm layer with large hidden_dims

`apex.amp` supports full half precision mode

Fused optimizers in `apex.optimizers` is quite faster

Fused layer norm layer with large `hidden_dims`