OptimalScale/LMFlow

Can't reproduce AGIEval result of LISA

Closed this issue · 1 comments

Hi,

I have a question about AGIEval (3-shot) in Table 2. Which version of lm_eval do you use for AGIEval?

I used v0.4.4 with 0-shot on the pretrained version of Llama-2-7b, the result is much better than the reported number.

Tasks Version Filter n-shot Metric Value Stderr
agieval 0 none acc 0.2906 ± 0.0048
- agieval_aqua_rat 1 none 0 acc 0.2598 ± 0.0276
none 0 acc_norm 0.2756 ± 0.0281
- agieval_gaokao_biology 1 none 0 acc 0.2381 ± 0.0295
none 0 acc_norm 0.2905 ± 0.0314
- agieval_gaokao_chemistry 1 none 0 acc 0.2367 ± 0.0296
none 0 acc_norm 0.2560 ± 0.0304
- agieval_gaokao_chinese 1 none 0 acc 0.2805 ± 0.0287
none 0 acc_norm 0.2805 ± 0.0287
- agieval_gaokao_english 1 none 0 acc 0.3595 ± 0.0275
none 0 acc_norm 0.2974 ± 0.0262
- agieval_gaokao_geography 1 none 0 acc 0.2412 ± 0.0304
none 0 acc_norm 0.2764 ± 0.0318
- agieval_gaokao_history 1 none 0 acc 0.2936 ± 0.0298
none 0 acc_norm 0.2340 ± 0.0277
- agieval_gaokao_mathcloze 1 none 0 acc 0.0424 ± 0.0186
- agieval_gaokao_mathqa 1 none 0 acc 0.2593 ± 0.0234
none 0 acc_norm 0.2650 ± 0.0236
- agieval_gaokao_physics 1 none 0 acc 0.3200 ± 0.0331
none 0 acc_norm 0.3400 ± 0.0336
- agieval_jec_qa_ca 1 none 0 acc 0.4454 ± 0.0157
none 0 acc_norm 0.4464 ± 0.0157
- agieval_jec_qa_kd 1 none 0 acc 0.4880 ± 0.0158
none 0 acc_norm 0.4920 ± 0.0158
- agieval_logiqa_en 1 none 0 acc 0.2473 ± 0.0169
none 0 acc_norm 0.2980 ± 0.0179
- agieval_logiqa_zh 1 none 0 acc 0.2565 ± 0.0171
none 0 acc_norm 0.3149 ± 0.0182
- agieval_lsat_ar 1 none 0 acc 0.2391 ± 0.0282
none 0 acc_norm 0.2000 ± 0.0264
- agieval_lsat_lr 1 none 0 acc 0.2431 ± 0.0190
none 0 acc_norm 0.2235 ± 0.0185
- agieval_lsat_rc 1 none 0 acc 0.2565 ± 0.0267
none 0 acc_norm 0.2268 ± 0.0256
- agieval_math 1 none 0 acc 0.0780 ± 0.0085
- agieval_sat_en 1 none 0 acc 0.3495 ± 0.0333
none 0 acc_norm 0.2427 ± 0.0299
- agieval_sat_en_without_passage 1 none 0 acc 0.3350 ± 0.0330
none 0 acc_norm 0.2087 ± 0.0284
- agieval_sat_math 1 none 0 acc 0.2455 ± 0.0291
none 0 acc_norm 0.2182 ± 0.0279
Groups Version Filter n-shot Metric Value Stderr
agieval 0 none acc 0.2906 ± 0.0048

Thanks for your interest in LMFlow and LISA! The version of lm_eval is 0.4.2, below is our eval log.
image
Hope this information can be helpful 😄