mit-han-lab/qserve

Question about the paper

jameswu2014 opened this issue · 3 comments

20240518-124213
F is a nonlinear function, why they are equivalent?

Hi, @jameswu2014, thank you so much for your interests in our work.
Given that the matrix ($\Lambda$) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, $F(\mathbf{X}\mathbf{W}^T\Lambda ) = F(\mathbf{X}\mathbf{W}^T)\Lambda$ holds true in this context.
Here $\mathbf{W}$ is the up_proj weights, $F(\mathbf{V}) = \mathbf{G} \odot \mathbf{V}$, and $\mathbf{G}$ is outputs of gate_proj, i.e., $F(\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = \mathbf{G} \odot (\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = (\mathbf{G} \odot (\mathbf{X}\mathbf{W}^T)) \mathbf{\Lambda} = F(\mathbf{X}\mathbf{W}^T)\mathbf{\Lambda}$.

Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.

Got it, Firstly, thank you for your reply. But I still have several questions about it.
1. you mean G(outputs of gate_proj) do not need to be rotated? or onlining rotated?
2. the silu op is after gate_proj, so G's precision is fp16, and also silu's output is fp16, gate_proj gemm is int4xint8->fp16? Is it right?

Hi @jameswu2014. For your questions,

  • We do not rotate block intermediate activations and thus outputs of gate_proj is not rotated.
  • All W4A8 GEMM kernels take in int4 weights and int8 activations, use INT8 tensor cores, and generate fp16 outputs.