Calculating Importance of 'param_mix'
kiucho opened this issue · 2 comments
Hello. First of all, thank you for sharing great research.
I have a question about calculating the importance of parameters.
In class TaylorImportance
in hf_llama_pruner.py
, line 274,
- Could you please tell me why the importance about mixed-order is calculated as follow:
salience = salience - 0.5 * layer.weight * layer.weight.acc_grad * layer.weight
(not as the sum of 1st and 2nd orders)
- Is higher-order term neglected?
Hi kiucho,
- For Question 1:
The derivation of this is (e.g., for eq.5 in the paper):
where
And thus, for here, it would be subtract the second hessian term from the first-order term.
There is a mistake in the first version of our paper and please refer to our code. We uploaded a new version of paper on arxiv (I'm not sure when it would be released, but I guess it would be available in the next 24 hours).
- For Question2:
Yes, we can neglect higher-order terms because their impact is negligible due to their small scale compared to the preceding term. This is primarily because the first-order term always dominates, given that the model consistently remains not fully convergence when applied to our calibration samples (evidenced by the presence of a large loss during the pruning process)
Thank you for your kind explanation. I checked the new version of your paper. Thanks once again.