This is the official repository for "MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models".
- Our MMLU-SR dataset is designed to challenge true comprehension and reasoning abilities of LLMs by symbol replacement.
- Our evaluations on
gpt-3.5-turbo
,gemini-1.0-pro
, andllama3-8b
showed significantly lower performance on MMLU-SR compared to the original MMLU, highlighting the value of our MMLU-SR.
- [10/2024] Our MMLU-SR is evaluated on
gpt-4o
and shows consistent lower performance. - [9/2024] Our MMLU-SR is accepted by EMNLP Workshop GenBench 2024!
- [8/2024] Our MMLU-SR is evaluated on
gemini-1.5-pro
,llama3-70b
andgpt-4o-mini
, showing lower performance on MMLU-SR compared to the original MMLU. Check our new experiment results! - [7/2024] Our MMLU-SR is merged to lm-eval-tasks!
- [6/2024] Project page set up at Paper with Code, with initial leaderboards for three MMLU-SR variants,
Question Only
,Answer Only
, andQuestion and Answer
.
It depends on what LLMs you are testing. To reproduce our experiment where we tested gpt-3.5-turbo
, gemini-1.0-pro
and llama3-8b
. There is a environment.yml
file as reference. First, you need to install conda in your system. Then you can run, assume running under Linux(we already found issues that Mac users will have package installation conflicts):
conda env create -f environment.yml
conda activate mmlusr
We created human eval questionnaire which extracts 50 problems from MMLU/MMLU-SR. If you are intersted, please take some time to do it. We really appreciate for it :) human
Our datasets can be found in dataset
folder, Google Drive, and also on Huggingface.
To evaluate our dataset using GPT and Gemini with specific task, you can run the following:
python3 evaluate.py --task question_only --model_name your_model
You can change task to question_only
, answer_only
, and question_and_answer
.
Once you have the output json files, you can use categories.py
to view the grouped results:
python3 categories.py
For Llama3, you need to look into lm-evaluation-harness
folder and follow the instruction.
The output of models' evaluation can be downloaded with this Google Drive link.
lm-evaluation-harness will automatically categorize all subjects results in its json output file.
We also provide a Hugging Face Dataset for users who want to use other frameworks like lm-evaluation-harness. To clone the entire dataset:
git clone https://huggingface.co/datasets/NiniCat/MMLU-SR
To run specific task(you can check the configuration to see the tasks):
from datasets import load_dataset
dataset = load_dataset("NiniCat/MMLU-SR", "answer_only_abstract_algebra")
You can check the train/test split by:
train_dataset = dataset["train"]
test_dataset = dataset["test"]
print(f"Number of training examples: {len(train_dataset)}")
print(f"Number of test examples: {len(test_dataset)}")
For running lm-eval below using huggingface models, you will need to first log in with your huggingface access token:
huggingface-cli login
Clone and install the lm-evaluation-harness
as follows:
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
Run the desired groups as follows, for example, to reproduce experiment results from paper using LLaMA3 8B with MMLU-SR Question-and-Answer dataset:
lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B,parallelize=True --tasks mmlusr --batch_size 2 --output_path 'your path'
You can change the models simply by change the model_args
, check lm_eval -h
for argument help, and more instructions on lm-eval.
You can also switch to evaluate other datasets or a single task, check below information:
mmlusr
: MMLU variant where the terminology in the question and answers are modified.mmlusr_answer_only
: MMLU variant where the terminology in the answers are modified.mmlusr_question_only
: MMLU variant where the terminology in the question is modified.
There are 57 symbol replaced subjects in each group. You can run a single task by:
mmlusr_question_only_abstract_algebra
Or by categories:
mmlusr_question_only_stem_tasks
Or by subset groupts:
mmlusr_question_only
Our experiments evaluated on gpt-3.5-turbo
, gemini-1.0-pro
, gemini-1.5-pro
, llama3-8b
and llama3-70b
are summarized in the table below:
Model/Dataset | Humanities | Social Sciences | STEM | Other | Average | Avg Drop |
---|---|---|---|---|---|---|
GPT-3.5-turbo | ||||||
MMLU (5-shot) | 0.723 | 0.770 | 0.554 | 0.714 | 0.677 | ----- |
Question Only (5-shot) | 0.661 | 0.702 | 0.506 | 0.641 | 0.616 | 9.08% |
Answer Only (5-shot) | 0.540 | 0.595 | 0.441 | 0.538 | 0.520 | 23.27% |
Q&A (5-shot) | 0.469 | 0.523 | 0.396 | 0.476 | 0.459 | 32.26% |
GPT-4o-mini | ||||||
MMLU (5-shot) | 0.793 | 0.858 | 0.689 | 0.782 | 0.771 | ----- |
Question Only (5-shot) | 0.744 | 0.792 | 0.621 | 0.724 | 0.710 | 7.91% |
Answer Only (5-shot) | 0.659 | 0.738 | 0.602 | 0.651 | 0.655 | 15.05% |
Q&A (5-shot) | 0.588 | 0.666 | 0.531 | 0.585 | 0.585 | 24.12% |
GPT-4o | ||||||
MMLU (5-shot) | 0.880 | 0.906 | 0.771 | 0.854 | 0.845 | ----- |
Question Only (5-shot) | 0.838 | 0.856 | 0.702 | 0.811 | 0.792 | 6.27% |
Answer Only (5-shot) | 0.764 | 0.824 | 0.705 | 0.760 | 0.757 | 10.41% |
Q&A (5-shot) | 0.708 | 0.754 | 0.635 | 0.712 | 0.695 | 17.75% |
Gemini-1.0-pro | ||||||
MMLU (5-shot) | 0.728 | 0.758 | 0.596 | 0.703 | 0.686 | ----- |
Question Only (5-shot) | 0.687 | 0.744 | 0.539 | 0.658 | 0.645 | 5.86% |
Answer Only (5-shot) | 0.619 | 0.670 | 0.504 | 0.591 | 0.586 | 14.48% |
Q&A (5-shot) | 0.582 | 0.622 | 0.472 | 0.544 | 0.546 | 20.85% |
Gemini-1.5-pro | ||||||
MMLU (5-shot) | 0.849 | 0.881 | 0.802 | 0.815 | 0.832 | ----- |
Question Only (5-shot) | 0.795 | 0.836 | 0.700 | 0.754 | 0.764 | 8.17% |
Answer Only (5-shot) | 0.741 | 0.816 | 0.747 | 0.739 | 0.758 | 8.89% |
Q&A (5-shot) | 0.690 | 0.752 | 0.670 | 0.681 | 0.694 | 16.59% |
Llama3-8B | ||||||
MMLU (5-shot) | 0.593 | 0.757 | 0.557 | 0.729 | 0.651 | ----- |
Question Only (5-shot) | 0.546 | 0.685 | 0.507 | 0.668 | 0.595 | 8.69% |
Answer Only (5-shot) | 0.455 | 0.599 | 0.460 | 0.557 | 0.510 | 21.28% |
Q&A (5-shot) | 0.421 | 0.538 | 0.424 | 0.499 | 0.465 | 28.63% |
Llama3-70B | ||||||
MMLU (5-shot) | 0.681 | 0.868 | 0.697 | 0.814 | 0.765 | ----- |
Question Only (5-shot) | 0.635 | 0.812 | 0.631 | 0.770 | 0.712 | 6.93% |
Answer Only (5-shot) | 0.539 | 0.683 | 0.565 | 0.622 | 0.602 | 21.31% |
Q&A (5-shot) | 0.523 | 0.653 | 0.536 | 0.591 | 0.576 | 24.71% |
If you use this datasets in your work, please cite it as follows:
@misc{wang2024mmlusrbenchmarkstresstestingreasoning,
title={MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models},
author={Wentian Wang and Sarthak Jain and Paul Kantor and Jacob Feldman and Lazaros Gallos and Hao Wang},
year={2024},
eprint={2406.15468},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.15468},
}