affjljoo3581/GPT2

Is Apex useful for GPT-2?

Closed this issue · 2 comments

hi, Is there a reduction in the size of the GPT-2 model when using Apex, is the inference speed of the model faster?

Hi! Thanks for your issue! We are currently trying to remove dependencies about apex, but it still provides some useful tricks in training.

apex.amp supports full half precision mode

When using automatic mixed precision in training, you can choose the amp level for amp.initialize. While O1 mode patches the model to actual mixed precision, O2 mode casts all parameters of the model to torch.float16 without restoring activations to torch.float32. Although the O1 mode is used in typical cases, O2 shows remarkable memory reduction and performance enhancement (about 50% faster!). Recently, pytorch introduces native amp in PyTorch 1.6. However, it only supports for O1-like mode so my project still needs apex.amp rather than PyTorch amp. We are working on implementing the automatic half-precision conversion without apex.amp library (and actually it is simple to implement!).

Fused optimizers in apex.optimizers is quite faster

apex contains not only mixed precision, but also some fused optimizers and layers. PyTorch has many optimizers (and even fused for some), but they are slow for language models. We empirically found that apex.optimizers.FusedAdam is much faster for Transformer-based models.

Fused layer norm layer with large hidden_dims

As mentioned above, apex contains fused layer norm layer which is for large hidden_dims tensors. Typically, torch.nn.LayerNorm is faster than apex.normalization.FusedLayerNorm. But in Transformer-based models, the input tensors have large hidden_dims and it leads worse performance in torch.nn.LayerNorm. In our simple experiments, we found FusedLayerNorm is up to 50% faster than PyTorch's one.

Consequently, we chose apex for the above empirical reasons. Note that we tested those for PyTorch 1.5 and 1.6.

@affjljoo3581 Thank you very much for your detailed and useful reply!