TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

PythonApache-2.0

Issues

Add Tencent Hunyuan-Large
#43 opened a month ago by EwoutH
0
Add Claude 3.5 Haiku
#42 opened a month ago by EwoutH
0
Paper claims there are 10-choices but the test split has varying number of choices (anywhere from 3 to 10)
#24 opened a month ago by eldarkurtic
6
New Model | meta-llama/Llama-3.1-405B-Instruct
#41 opened a month ago by agm-eratosth
0
New Model | mistralai/Mistral-Large-Instruct-2407
#40 opened a month ago by agm-eratosth
0
CUDA error: no kernel image is available for execution on the device
#29 opened a month ago by jakethesnake1126
0
New model | Cohere Aya Expanse
#37 opened a month ago by NSbuilder
1
Add SmolLM2 1.7B
#38 opened a month ago by EwoutH
1
Which DeepSeek-Coder-V2?
#39 opened a month ago by billbradley
1
OpenAI o1-preview and o1-mini
#21 opened 3 months ago by EwoutH
3
New model | Yi - Lightning
#36 opened a month ago by NSbuilder
1
Add Mistral Small v24.09
#35 opened a month ago by EwoutH
2
Add Ministral 3B and 8B
#34 opened a month ago by EwoutH
1
Llama-3.1-nemotron-70b-instruct
#30 opened 2 months ago by NSbuilder
2
Add Gemini-1.5-Flash-002 and -Pro-002
#25 opened 2 months ago by EwoutH
2
What is the Arx-0.3 model?
#31 opened 2 months ago by DenisSergeevitch
1
regarding leaderboard submission
#26 opened 2 months ago by sorobedio
1
Add Qwen2.5 model family
#22 opened 2 months ago by EwoutH
4
Suggested minimum context length requirement?
#23 opened 2 months ago by ubergarm
2
Support for standard deviation
#11 opened 2 months ago by RodriMora
1
eval_results do not contain the actual answer, right?
#17 opened 3 months ago by emanuelevivoli
2
Variable length of "options"?
#14 opened 3 months ago by billbradley
1
Why dont use chat template for chat model?
#15 opened 3 months ago by eyuansu62
1
where is global_record_file="eval_results/eval_record_collection.csv"？
#18 opened 3 months ago by lianshan01
1
Questionable questions
#16 opened 3 months ago by billbradley
1
Possible to remove spam model result
#13 opened 4 months ago by mrconter1
1
Add Grok-2?
#12 opened 4 months ago by mrconter1
1
Request for Llama3.1 8B, 70B and 405B
#10 opened 4 months ago by RodriMora
5
Add Gemma 2 9B and 27B
#4 opened 5 months ago by carterprince
4
Potential coding errors in `evaluate_from_api.py`
#9 opened 4 months ago by sudanl
1
Regex pattern in extract_final function.
#7 opened 5 months ago by chigkim
11
Duplicates in test split
#6 opened 5 months ago by Pupy101
1
Different Setup for Different Models?
#5 opened 5 months ago by chigkim
6
Chat template for instruct models for local eval
#1 opened 6 months ago by gnalbandyan
1