TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
PythonApache-2.0
Issues
- 0
Add Tencent Hunyuan-Large
#43 opened by EwoutH - 0
Add Claude 3.5 Haiku
#42 opened by EwoutH - 6
Paper claims there are 10-choices but the test split has varying number of choices (anywhere from 3 to 10)
#24 opened by eldarkurtic - 0
- 0
- 0
- 1
New model | Cohere Aya Expanse
#37 opened by NSbuilder - 1
Add SmolLM2 1.7B
#38 opened by EwoutH - 1
Which DeepSeek-Coder-V2?
#39 opened by billbradley - 3
OpenAI o1-preview and o1-mini
#21 opened by EwoutH - 1
New model | Yi - Lightning
#36 opened by NSbuilder - 2
Add Mistral Small v24.09
#35 opened by EwoutH - 1
Add Ministral 3B and 8B
#34 opened by EwoutH - 2
Llama-3.1-nemotron-70b-instruct
#30 opened by NSbuilder - 2
Add Gemini-1.5-Flash-002 and -Pro-002
#25 opened by EwoutH - 1
What is the Arx-0.3 model?
#31 opened by DenisSergeevitch - 1
regarding leaderboard submission
#26 opened by sorobedio - 4
Add Qwen2.5 model family
#22 opened by EwoutH - 2
Suggested minimum context length requirement?
#23 opened by ubergarm - 1
Support for standard deviation
#11 opened by RodriMora - 2
- 1
Variable length of "options"?
#14 opened by billbradley - 1
Why dont use chat template for chat model?
#15 opened by eyuansu62 - 1
- 1
Questionable questions
#16 opened by billbradley - 1
Possible to remove spam model result
#13 opened by mrconter1 - 1
Add Grok-2?
#12 opened by mrconter1 - 5
Request for Llama3.1 8B, 70B and 405B
#10 opened by RodriMora - 4
Add Gemma 2 9B and 27B
#4 opened by carterprince - 1
- 11
Regex pattern in extract_final function.
#7 opened by chigkim - 1
Duplicates in test split
#6 opened by Pupy101 - 6
Different Setup for Different Models?
#5 opened by chigkim - 1