Calculating Importance of 'param_mix'

Question

Calculating Importance of 'param_mix'

kiucho opened this issue a year ago · 2 comments

Hello. First of all, thank you for sharing great research.

I have a question about calculating the importance of parameters.

In class TaylorImportance in hf_llama_pruner.py, line 274,

Could you please tell me why the importance about mixed-order is calculated as follow:

salience = salience - 0.5 * layer.weight * layer.weight.acc_grad * layer.weight

(not as the sum of 1st and 2nd orders)

Is higher-order term neglected?

Answer 1 · 2023-09-27T12:14:03.000Z

Hi kiucho,

For Question 1:

The derivation of this is (e.g., for eq.5 in the paper):

where

And thus, for here, it would be subtract the second hessian term from the first-order term.

There is a mistake in the first version of our paper and please refer to our code. We uploaded a new version of paper on arxiv (I'm not sure when it would be released, but I guess it would be available in the next 24 hours).

For Question2:

Yes, we can neglect higher-order terms because their impact is negligible due to their small scale compared to the preceding term. This is primarily because the first-order term always dominates, given that the model consistently remains not fully convergence when applied to our calibration samples (evidenced by the presence of a large loss during the pruning process)

Answer 2 · 2023-10-04T02:31:15.000Z

Thank you for your kind explanation. I checked the new version of your paper. Thanks once again.