can't reproduce result
Closed this issue ยท 8 comments
Hello, I used the following configuration for the slice operation:
python run_slicegpt_perplexity.py \
--model microsoft/phi-2 \
--model-path .../phi-2 \
--final-orientation pca \
--cal-nsamples 1024 \
--cal-max-seqlen 2048 \
--save-dir ./test \
--cal-dataset alpaca \
--sparsity 0.3 \
--cal-batch-size 4 \
--no-wandb \
--device cuda
However, the avg per. I obtained was only 55, while the result in the paper is 63. I am using lm-eval version 0.4.0, which should be the same as the version the project depends on (I have tried both PCA and Random orientations).
That's odd - can you report the individual task accuracies? Are you able to reproduce this test result when slicing on wikitext2
at 20%?
I suspect this might be an issue with the version of lm_eval. The results from the 0.4 series are generally lower. I switched to version 0.3.0 (used by the Hugging Face Open Leaderboard) and obtained the following results:
method | params | ppl | piqa | wg | hs | arc-e | arc-c | avg per. |
---|---|---|---|---|---|---|---|---|
llama2-7b | 6.7 | 0.772 | 0.6709 | 0.7291 | 0.5345 | 0.4078 | 0.6228 | |
phi-2-2.7b | 2.7 | 0.7943 | 0.7609 | 0.736 | 0.7854 | 0.5418 | 0.7236 | |
slicegpt-llama2-0.3(alpaca) | 5.29 | 3.2465 | 0.7116 | 0.5943 | 0.5377 | 0.5316 | 0.3601 | 0.547 |
slicegpt-phi-0.3(alpaca) | 2.09 | 3.3796 | 0.7443 | 0.6212 | 0.5342 | 0.3865 | 0.6713 | 0.5914 |
(the orientation is random)
Although the results for phi-2 are close to those in the paper, the results for llama2-7b are much worse. Moreover, the results after slicing also seem less than ideal.
By the way, I passed test_experiments.py
Hi @nailimixaM, I tried different variables and finally found that the issue is with the batch size and the alpaca dataset.
The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).
Additionally, there might be an issue with the alpaca dataset, as I can only reproduce the correct results on wikitext2.
I hope you can investigate the problem. ๐
I think I know the reason why the results cannot be reproduced. The average performance of the phi-2-sparsity-30 (alpaca) without RFT, listed in Appendix A.5, is not correct...
The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).
For the paper I ran these experiments with batch size > 1, some noise is expected between different batch size use but the results should be largely the same. Can you confirm you're using the lm_eval version from our .toml file?
@nailimixaM , yes, I used the lm_eval provided by you๐. However, I can get the correct results now.
@nailimixaM so if I need to reduce the model to -20% of its parameters, I need to use a slicing ratio of 30%? This is what I understand from the screenshots of @MrGGLS, where 30% of slicing corresponds to a -20% of parameters reduction. Correct me if I misunderstood something.
Related to this issue: #165