Can I use fp8 only when the code runs to the fp8 branch?

Question

Can I use fp8 only when the code runs to the fp8 branch?

forevergj opened this issue 5 months ago · 8 comments

I am using deepspeed. ops. adam FusedAdam cannot go to the fp8 branch

deepspeed zero stage2

Answer 1 · 2024-05-13T10:45:57.000Z

Hi @forevergj , thanks for your attention to our work!

The reason is that master weight, weight, weight gradient and optimizer states are all tensors with scaling factors. FusedAdam did not support the computation on scaling tensors.

Answer 2 · 2024-05-13T14:22:15.000Z

Thank you for your reply！Where do the benefits of using msamp for training acceleration come from?

Answer 3 · 2024-05-14T06:17:08.000Z

@forevergj
MS-AMP applies low-bit data formats on master weight, weight, weight gradient and optimizer states. It saves the GPU memory to allow larger batch size for faster training speed.

Answer 4 · 2024-05-14T06:30:01.000Z

MS-AMP applies low-bit data formats on master weight, weight, weight gradient and optimizer states. It saves the GPU memory to allow larger batch size for faster training speed.

1.Does it mean that the fp8 operation of msamp only occurs after backpropagation when the optimizer updates the weights. However, forward propagation does not involve fp8 operations?
2.Is the main optimization focus on gradient communication time?

Answer 5 · 2024-05-15T01:27:10.000Z

MS-AMP applies low-bit data formats on master weight, weight, weight gradient and optimizer states. It saves the GPU memory to allow larger batch size for faster training speed.

1.Does it mean that the fp8 operation of msamp only occurs after backpropagation when the optimizer updates the weights. However, forward propagation does not involve fp8 operations? 2.Is the main optimization focus on gradient communication time?

The forward propagation and backward propagation both involve FP8.
No. The acceleration benefits from FP8 matrix multiplication in the linear layers. MS-AMP reduces the GPU memory by low-precision data formats. It allows larger batch size for the further acceleration.

Answer 6 · 2024-05-15T02:13:10.000Z

Is it similar to a transformer engine? What is the difference between msamp and transformer engine？

Does the image in the MSAMP paper mean that the conversion of FP8 is executed in two places. They are before the all-gather gradient and before executing Linear. This means that the output of Linear is high-precision？

Answer 7 · 2024-05-15T06:29:11.000Z

Is it similar to a transformer engine? What is the difference between msamp and transformer engine？ Does the image in the MSAMP paper mean that the conversion of FP8 is executed in two places. They are before the all-gather gradient and before executing Linear. This means that the output of Linear is high-precision？

MS-AMP can be combined with TransformerEngine. The difference is that MS-AMP applies FP8 data format on master weight, weight, weight gradient, optimizer states, and the communication of gradient reduction.
Regarding to your second question, the output of LayerNorm is a high-precision tensor (FP16 or BF16), which is converted to an FP8 tensor with a scaling factor, as the input of the all-gather operation. The output of the all-gather operation is an FP8 tensor with a scaling factor, that is the input of the FP8 GEMM (YA). The output of the FP8 GEMM (YA) is a high-precision tensor (FP16 or BF16). Z is also a high-precision tensor, being converted to an FP8 tensor with a scaling factor.

Answer 8 · 2024-05-15T06:32:50.000Z

Thank you for your reply. It's very clear