/llm_benchmarks

A collection of benchmarks and datasets for evaluating LLM.

llm_benchmarks

A collection of benchmarks and datasets for evaluating LLM.

Knowledge and Language Understanding

Massive Multitask Language Understanding (MMLU)

  • Description: Measures general knowledge across 57 different subjects, ranging from STEM to social sciences.
  • Purpose: To assess the LLM's understanding and reasoning in a wide range of subject areas.
  • Relevance: Ideal for multifaceted AI systems that require extensive world knowledge and problem solving ability.
  • Source: Measuring Massive Multitask Language Understanding
  • Resources:

AI2 Reasoning Challenge (ARC)

General Language Understanding Evaluation (GLUE)

Natural Questions

LAnguage Modelling Broadened to Account for Discourse Aspects (LAMBADA)

HellaSwag

  • Description: Tests natural language inference by requiring LLMs to complete passages in a way that requires understanding intricate details.
  • Purpose: To evaluate the model's ability to generate contextually appropriate text continuations.
  • Relevance: Useful in content creation, dialogue systems, and applications requiring advanced text generation capabilities.
  • Source: HellaSwag: Can a Machine Really Finish Your Sentence?
  • Resources:

Multi-Genre Natural Language Inference (MultiNLI)

SuperGLUE

TriviaQA

WinoGrande

SciQ

  • Description: Consists of multiple-choice questions mainly in natural sciences like physics, chemistry, and biology.
  • Purpose: To test the ability to answer science-based questions, often with additional supporting text.
  • Relevance: Useful for educational tools, especially in science education and knowledge testing platforms.
  • Source: Crowdsourcing Multiple Choice Science Questions
  • Resources:

Reasoning Capabilities

GSM8K

  • Description: A set of 8.5K grade-school math problems that require basic to intermediate math operations.
  • Purpose: To test LLMs’ ability to work through multistep math problems.
  • Relevance: Useful for assessing AI’s capability in solving basic mathematical problems, valuable in educational contexts.
  • Source: Training Verifiers to Solve Math Word Problems
  • Resources:

Discrete Reasoning Over Paragraphs (DROP)

  • Description: An adversarially-created reading comprehension benchmark requiring models to navigate through references and execute operations like addition or sorting.
  • Purpose: To evaluate the ability of models to understand complex texts and perform discrete operations.
  • Relevance: Useful in advanced educational tools and text analysis systems requiring logical reasoning.
  • Source: DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
  • Resources::

Counterfactual Reasoning Assessment (CRASS)

Large-scale ReAding Comprehension Dataset From Examinations (RACE)

  • Description: A set of reading comprehension questions derived from English exams given to Chinese students.
  • Purpose: To test LLMs' understanding of complex reading material and their ability to answer examination-level questions.
  • Relevance: Useful in language learning applications and educational systems for exam preparation.
  • Source: RACE: Large-scale ReAding Comprehension Dataset From Examinations
  • Resources:

Big-Bench Hard (BBH)

AGIEval

BoolQ

Multi Turn Open Ended Conversations

MT-bench

Question Answering in Context (QuAC)

  • Description: Features 14,000 dialogues with 100,000 question-answer pairs, simulating student-teacher interactions.
  • Purpose: To challenge LLMs with context-dependent, sometimes unanswerable questions within dialogues.
  • Relevance: Useful for conversational AI, educational software, and context-aware information systems.
  • Source: QuAC : Question Answering in Context
  • Resources:

Grounding and Abstractive Summarization

Ambient Clinical Intelligence Benchmark (ACI-BENCH)

MAchine Reading COmprehension Dataset (MS-MARCO)

  • Description: A large-scale collection of natural language questions and answers derived from real web queries.
  • Purpose: To test the ability of models to accurately understand and respond to real-world queries.
  • Relevance: Crucial for search engines, question-answering systems, and other consumer-facing AI applications.
  • Source: MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
  • Resources:

Query-based Multi-domain Meeting Summarization (QMSum)

  • Description: A benchmark for summarizing relevant spans of meetings in response to specific queries.
  • Purpose: To evaluate the ability of models to extract and summarize important information from meeting content.
  • Relevance: Useful for business intelligence tools, meeting analysis applications, and automated summarization systems.
  • Source: QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization
  • Resources:

Physical Interaction: Question Answering (PIQA)

  • Description: Tests knowledge and understanding of the physical world through hypothetical scenarios and solutions.
  • Purpose: To measure the model’s capability in handling physical interaction scenarios.
  • Relevance: Important for AI applications in robotics, physical simulations, and practical problem-solving systems.
  • Source:: PIQA: Reasoning about Physical Commonsense in Natural Language
  • Resources:

Content Moderation and Narrative Control

ToxiGen

Helpfulness, Honesty, Harmlessness (HHH)

TruthfulQA

  • Description: A benchmark for evaluating the truthfulness of LLMs in generating answers to questions prone to false beliefs and biases.
  • Purpose: To test the ability of models to provide accurate and unbiased information.
  • Relevance: Important for AI systems where delivering accurate and unbiased information is critical, such as in educational or advisory roles.
  • Source: TruthfulQA: Measuring How Models Mimic Human Falsehoods
  • Resources:

Responsible AI (RAI)

Coding Capabilities

CodeXGLUE

HumanEval

  • Description: Contains programming challenges for evaluating LLMs' ability to write functional code based on instructions.
  • Purpose: To test the generation of correct and efficient code from given requirements.
  • Relevance: Important for automated code generation tools, programming assistants, and coding education platforms.
  • Source: Evaluating Large Language Models Trained on Code
  • Resources:

Mostly Basic Python Programming (MBPP)

  • Description: Includes 1,000 Python programming problems suitable for entry-level programmers.
  • Purpose: To evaluate proficiency in solving basic programming tasks and understanding of Python.
  • Relevance: Useful for beginner-level coding education, automated code generation, and entry-level programming testing.
  • Source: Program Synthesis with Large Language Models
  • Resources:

LLM-Assisted Evaluation

LLM Judge

  • Source: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  • Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at this https URL.
  • Insights:
    • Use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
  • Resources:

LLM-Eval

  • Source: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
  • Abstract: We propose LLM-Eval, a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs). Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming. To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods. Our analysis also highlights the importance of choosing suitable LLMs and decoding strategies for accurate evaluation results. LLM-Eval offers a versatile and robust solution for evaluating open-domain conversation systems, streamlining the evaluation process and providing consistent performance across diverse scenarios.
  • Insights:
    • Top-shelve LLM (e.g. GPT4, Claude) correlate better with human score than metric-based eval measures.

JudgeLM

  • Source: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
  • Abstract: Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
  • Insights:
    • Relatively small models (e.g 7b models) can be fine-tuned to be reliable judges of other models.

Prometheus

  • Source: Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
  • Abstract: Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model.
  • Insights:
    • A scoring rubric and a reference answer vastly improve correlation with human scores.

Industry Resources

  • Latent Space - Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
    • Summary:
      • The OpenLLM Leaderboard, maintained by Clémentine Fourrier, is a standardized and reproducible way to evaluate language models' performance.
      • The leaderboard initially gained popularity in summer 2023 and has had over 2 million unique visitors and 300,000 active community members.
      • The recent update to the leaderboard (v2) includes six benchmarks to address model overfitting and to provide more room for improved performance.
      • LLMs are not recommended as judges due to issues like mode collapse and positional bias.
      • If LLMs must be used as judges, open LLMs like Prometheus or JudgeLM are suggested for reproducibility.
      • The LMSys Arena is another platform for AI engineers, but its rankings are not reproducible and may not accurately reflect model capabilities.