How to smooth rotation and permutation matrices
Opened this issue · 5 comments
In contrast to SmoothQuant wherein the smoothing matrix (
How can we integrate this matrix into other layers (e.g., LayerNorm)?
If it cannot be merged, then additional computational costs incur in the inference.
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
In contrast to SmoothQuant wherein the smoothing matrix (
$Λ$ ) is diagonal, the smoothing matrix$$ΛR_1PR_2 $$ of DuQuant is dense.How can we integrate this matrix into other layers (e.g., LayerNorm)?
If it cannot be merged, then additional computational costs incur in the inference.
Thank you for your question. In DuQuant, the rotation matrix is block-wise, and we only store channel IDs for the permutation matrix
Our speed measurements indicate that DuQuant incurs only about a 9% extra cost compared to the RTN method, which is reasonable considering the performance benefits. Please refer to Section 4.2 and Appendix E.1.
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers.
In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers.
In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.
I am appreciated for your reply!
I have another question: as for QuaRot, R1 and R2 can be absorbed into the weight. For DuQuant, from the paper, the inference speed may be slower than QuaRot?( DuQuant has more online rotation matrix matmul operations)
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers.
In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.I am appreciated for your reply!
I have another question: as for QuaRot, R1 and R2 can be absorbed into the weight. For DuQuant, from the paper, the inference speed may be slower than QuaRot?( DuQuant has more online rotation matrix matmul operations)
Hi, thanks for your further question!
We have conducted more speedup evaluations for the pre-filling and decoding stages, including a comparison with QuaRot. Results show that additional computational cost is manageable, and the speed is comparable to QuaRot. We plan to include a detailed analysis in the camera-ready version.