Question about the paper
jameswu2014 opened this issue · 3 comments
Hi, @jameswu2014, thank you so much for your interests in our work.
Given that the matrix (
Here
Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.
Got it, Firstly, thank you for your reply. But I still have several questions about it.
1. you mean G(outputs of gate_proj) do not need to be rotated? or onlining rotated?
2. the silu op is after gate_proj, so G's precision is fp16, and also silu's output is fp16, gate_proj gemm is int4xint8->fp16? Is it right?
Hi @jameswu2014. For your questions,
- We do not rotate block intermediate activations and thus outputs of
gate_proj
is not rotated. - All W4A8 GEMM kernels take in int4 weights and int8 activations, use INT8 tensor cores, and generate fp16 outputs.