Can't reproduce AGIEval result of LISA
Closed this issue · 1 comments
BaohaoLiao commented
Hi,
I have a question about AGIEval (3-shot) in Table 2. Which version of lm_eval do you use for AGIEval?
I used v0.4.4 with 0-shot on the pretrained version of Llama-2-7b, the result is much better than the reported number.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval | 0 | none | acc | ↑ | 0.2906 | ± | 0.0048 | |
- agieval_aqua_rat | 1 | none | 0 | acc | ↑ | 0.2598 | ± | 0.0276 |
none | 0 | acc_norm | ↑ | 0.2756 | ± | 0.0281 | ||
- agieval_gaokao_biology | 1 | none | 0 | acc | ↑ | 0.2381 | ± | 0.0295 |
none | 0 | acc_norm | ↑ | 0.2905 | ± | 0.0314 | ||
- agieval_gaokao_chemistry | 1 | none | 0 | acc | ↑ | 0.2367 | ± | 0.0296 |
none | 0 | acc_norm | ↑ | 0.2560 | ± | 0.0304 | ||
- agieval_gaokao_chinese | 1 | none | 0 | acc | ↑ | 0.2805 | ± | 0.0287 |
none | 0 | acc_norm | ↑ | 0.2805 | ± | 0.0287 | ||
- agieval_gaokao_english | 1 | none | 0 | acc | ↑ | 0.3595 | ± | 0.0275 |
none | 0 | acc_norm | ↑ | 0.2974 | ± | 0.0262 | ||
- agieval_gaokao_geography | 1 | none | 0 | acc | ↑ | 0.2412 | ± | 0.0304 |
none | 0 | acc_norm | ↑ | 0.2764 | ± | 0.0318 | ||
- agieval_gaokao_history | 1 | none | 0 | acc | ↑ | 0.2936 | ± | 0.0298 |
none | 0 | acc_norm | ↑ | 0.2340 | ± | 0.0277 | ||
- agieval_gaokao_mathcloze | 1 | none | 0 | acc | ↑ | 0.0424 | ± | 0.0186 |
- agieval_gaokao_mathqa | 1 | none | 0 | acc | ↑ | 0.2593 | ± | 0.0234 |
none | 0 | acc_norm | ↑ | 0.2650 | ± | 0.0236 | ||
- agieval_gaokao_physics | 1 | none | 0 | acc | ↑ | 0.3200 | ± | 0.0331 |
none | 0 | acc_norm | ↑ | 0.3400 | ± | 0.0336 | ||
- agieval_jec_qa_ca | 1 | none | 0 | acc | ↑ | 0.4454 | ± | 0.0157 |
none | 0 | acc_norm | ↑ | 0.4464 | ± | 0.0157 | ||
- agieval_jec_qa_kd | 1 | none | 0 | acc | ↑ | 0.4880 | ± | 0.0158 |
none | 0 | acc_norm | ↑ | 0.4920 | ± | 0.0158 | ||
- agieval_logiqa_en | 1 | none | 0 | acc | ↑ | 0.2473 | ± | 0.0169 |
none | 0 | acc_norm | ↑ | 0.2980 | ± | 0.0179 | ||
- agieval_logiqa_zh | 1 | none | 0 | acc | ↑ | 0.2565 | ± | 0.0171 |
none | 0 | acc_norm | ↑ | 0.3149 | ± | 0.0182 | ||
- agieval_lsat_ar | 1 | none | 0 | acc | ↑ | 0.2391 | ± | 0.0282 |
none | 0 | acc_norm | ↑ | 0.2000 | ± | 0.0264 | ||
- agieval_lsat_lr | 1 | none | 0 | acc | ↑ | 0.2431 | ± | 0.0190 |
none | 0 | acc_norm | ↑ | 0.2235 | ± | 0.0185 | ||
- agieval_lsat_rc | 1 | none | 0 | acc | ↑ | 0.2565 | ± | 0.0267 |
none | 0 | acc_norm | ↑ | 0.2268 | ± | 0.0256 | ||
- agieval_math | 1 | none | 0 | acc | ↑ | 0.0780 | ± | 0.0085 |
- agieval_sat_en | 1 | none | 0 | acc | ↑ | 0.3495 | ± | 0.0333 |
none | 0 | acc_norm | ↑ | 0.2427 | ± | 0.0299 | ||
- agieval_sat_en_without_passage | 1 | none | 0 | acc | ↑ | 0.3350 | ± | 0.0330 |
none | 0 | acc_norm | ↑ | 0.2087 | ± | 0.0284 | ||
- agieval_sat_math | 1 | none | 0 | acc | ↑ | 0.2455 | ± | 0.0291 |
none | 0 | acc_norm | ↑ | 0.2182 | ± | 0.0279 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval | 0 | none | acc | ↑ | 0.2906 | ± | 0.0048 |
Dominic789654 commented