Results don't match
Opened this issue · 4 comments
Greetings,
I've done some experiments with PrunerZero and Wanda and saw that there are some results that don't match with the paper. Please find below the obtained results:
Method | BoolQ | RTE | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA | Mean |
---|---|---|---|---|---|---|---|---|
Wanda | 76.61% | 53.07% | 52.70% | 67.96% | 72.31% | 38.91% | 30.80% | 56.05% |
Wanda (norm) | 76.61% | 53.07% | 70.93% | 67.96% | 69.19% | 43.00% | 43.20% | 60.57% |
PrunerZero | 70.37% | 53.07% | 51.15% | 66.22% | 71.72% | 36.69% | 28.00% | 53.89% |
PrunerZero (norm) | 70.37% | 53.07% | 68.90% | 66.22% | 67.89% | 39.08% | 40.80% | 58.05% |
And the results from the paper:
I've put in a new line the results from tasks which had a normalized accuracy (red and purple). I only repeated the accuracy for tasks that didnt have it, which are in yellow in the screenshot.
Maybe there was a mistake, which only for PrunerZero the normalized accuracy was reported. Can you guys check it?
Best regards!
Is there any update regarding this issue?
Hi, sorry for the late reply. We employ the higher one from norm and non-norm results.
Basically, your results are the same with ours. Due to the difference in CUDA, GPU, and different devices, there should be some deviation, which seems acceptable.
Here is my recipe:
- CUDA 12.0
- Python 3.9
- A6000
Best regards
I'm sorry, but this is clearly wrong. If you use the higher one for Pruner-Zero, you should also apply the same rule for other methods (like Wanda). As we can see, Wanda had a mean of 60.57% using the normalized accuracy. Other works didn't used the normalized accuracy. PrunerZero should be better than using only the magnitude, but it's worse than SparseGPT and Wanda.
Thank you for pointing that out. I will recheck it these days. Maybe using the downstream tasks performance as fitness is a better way.