Is Apex useful for GPT-2?
Closed this issue · 2 comments
hi, Is there a reduction in the size of the GPT-2 model when using Apex, is the inference speed of the model faster?
Hi! Thanks for your issue! We are currently trying to remove dependencies about apex
, but it still provides some useful tricks in training.
apex.amp
supports full half precision mode
When using automatic mixed precision in training, you can choose the amp level for amp.initialize
. While O1
mode patches the model to actual mixed precision, O2
mode casts all parameters of the model to torch.float16
without restoring activations to torch.float32
. Although the O1
mode is used in typical cases, O2
shows remarkable memory reduction and performance enhancement (about 50% faster!). Recently, pytorch introduces native amp in PyTorch 1.6. However, it only supports for O1
-like mode so my project still needs apex.amp
rather than PyTorch amp
. We are working on implementing the automatic half-precision conversion without apex.amp
library (and actually it is simple to implement!).
Fused optimizers in apex.optimizers
is quite faster
apex
contains not only mixed precision, but also some fused optimizers and layers. PyTorch has many optimizers (and even fused for some), but they are slow for language models. We empirically found that apex.optimizers.FusedAdam
is much faster for Transformer-based models.
Fused layer norm layer with large hidden_dims
As mentioned above, apex
contains fused layer norm layer which is for large hidden_dims
tensors. Typically, torch.nn.LayerNorm
is faster than apex.normalization.FusedLayerNorm
. But in Transformer-based models, the input tensors have large hidden_dims
and it leads worse performance in torch.nn.LayerNorm
. In our simple experiments, we found FusedLayerNorm
is up to 50% faster than PyTorch's one.
Consequently, we chose apex
for the above empirical reasons. Note that we tested those for PyTorch 1.5 and 1.6.
@affjljoo3581 Thank you very much for your detailed and useful reply!