microsoft/TransformerCompression

can't reproduce result

Closed this issue ยท 8 comments

Hello, I used the following configuration for the slice operation:

python run_slicegpt_perplexity.py \
    --model microsoft/phi-2 \
    --model-path .../phi-2 \
    --final-orientation pca \
    --cal-nsamples 1024 \
    --cal-max-seqlen 2048 \
    --save-dir ./test \
    --cal-dataset alpaca \
    --sparsity 0.3 \
    --cal-batch-size 4 \
    --no-wandb \
    --device cuda

However, the avg per. I obtained was only 55, while the result in the paper is 63. I am using lm-eval version 0.4.0, which should be the same as the version the project depends on (I have tried both PCA and Random orientations).

That's odd - can you report the individual task accuracies? Are you able to reproduce this test result when slicing on wikitext2 at 20%?

I suspect this might be an issue with the version of lm_eval. The results from the 0.4 series are generally lower. I switched to version 0.3.0 (used by the Hugging Face Open Leaderboard) and obtained the following results:

method params ppl piqa wg hs arc-e arc-c avg per.
llama2-7b 6.7   0.772 0.6709 0.7291 0.5345 0.4078 0.6228
phi-2-2.7b 2.7   0.7943 0.7609 0.736 0.7854 0.5418 0.7236
slicegpt-llama2-0.3(alpaca) 5.29 3.2465 0.7116 0.5943 0.5377 0.5316 0.3601 0.547
slicegpt-phi-0.3(alpaca) 2.09 3.3796 0.7443 0.6212 0.5342 0.3865 0.6713 0.5914

(the orientation is random)
Although the results for phi-2 are close to those in the paper, the results for llama2-7b are much worse. Moreover, the results after slicing also seem less than ideal.

By the way, I passed test_experiments.py

Hi @nailimixaM, I tried different variables and finally found that the issue is with the batch size and the alpaca dataset.

The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).

Additionally, there might be an issue with the alpaca dataset, as I can only reproduce the correct results on wikitext2.

I hope you can investigate the problem. ๐Ÿ˜ƒ

I think I know the reason why the results cannot be reproduced. The average performance of the phi-2-sparsity-30 (alpaca) without RFT, listed in Appendix A.5, is not correct...

@MrGGLS

The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).

For the paper I ran these experiments with batch size > 1, some noise is expected between different batch size use but the results should be largely the same. Can you confirm you're using the lm_eval version from our .toml file?

@nailimixaM , yes, I used the lm_eval provided by you๐Ÿ˜‚. However, I can get the correct results now.

@nailimixaM so if I need to reduce the model to -20% of its parameters, I need to use a slicing ratio of 30%? This is what I understand from the screenshots of @MrGGLS, where 30% of slicing corresponds to a -20% of parameters reduction. Correct me if I misunderstood something.

Related to this issue: #165