/LLM-Evals-Catalogue

This repository stems from our paper, “Cataloguing LLM Evaluations”, and serves as a living, collaborative catalogue of LLM evaluation frameworks, benchmarks and papers.

Cataloguing LLM Evaluations

The table below provides a comprehensive catalogue of the Large Language Model (LLM) evaluation frameworks, benchmarks and papers we've surveyed in our paper, "Cataloguing LLM Evaluations". It organizes them based on the taxonomy proposed in our paper.

The realm of LLM evaluation is advancing at an unparalleled pace. Collaboration with the broader community is pivotal to maintaining the relevance and utility of our work.

To that end, we invite submissions of LLM evaluation frameworks, benchmarks, and papers for inclusion in this catalogue.

Before you raise a PR for a new submission, please read our contribution guidelines. Submissions will be reviewed and integrated into the catalogue on a rolling basis.

For any inquiries, feel free to reach out to us at info@aiverify.sg.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
Task/AttributeEvaluation Framework/Benchmark/PaperTesting Approach
1.1. Natural Language Understanding
Text classificationHELM
  • Miscellaneous text classification
Benchmarking
Big-bench
  • Emotional understanding
  • Intent recognition
  • Humor
Benchmarking
Hugging Face
  • Text classification
  • Token classification
  • Zero-shot classification
Benchmarking
Sentiment analysisHELM
  • Sentiment analysis
Benchmarking
Evaluation Harness
  • GLUE
Benchmarking
Big-bench
  • Emotional understanding
Benchmarking
Toxicity detectionHELM
  • Toxicity detection
Benchmarking
Evaluation Harness
  • ToxiGen
Benchmarking
Big-bench
  • Toxicity
Benchmarking
Information retrievalHELM
  • Information retrieval
Benchmarking
Sufficient informationBig-bench
  • Sufficient information
Benchmarking
FLASK
  • Metacognition
Benchmarking (with human and model scoring)
Natural language inferenceBig-bench
  • Analytic entailment (specific task)
  • Formal fallacies and syllogisms with negation (specific task)
  • Entailed polarity (specific task)
Benchmarking
Evaluation Harness
  • GLUE
Benchmarking
General English understandingHELM
  • Language
Benchmarking
Big-bench
  • Morphology
  • Grammar
  • Syntax
Benchmarking
Evaluation Harness
  • BLiMP
Benchmarking
Eval Gauntlet
  • Language Understanding
Benchmarking
1.2. Natural Language Generation
SummarizationHELM
  • Summarization
Benchmarking
Big-bench
  • Summarization
Benchmarking
Evaluation Harness
  • BLiMP
Benchmarking
Hugging Face
  • Summarization
Benchmarking
Question generation and answeringHELM
  • Question answering
Benchmarking
Big-bench
  • Contextual question answering
  • Reading comprehension
  • Question generation
Benchmarking
Evaluation Harness
  • CoQA
  • ARC
Benchmarking
FLASK
  • Logical correctness
  • Logical robustness
  • Logical efficiency
  • Comprehension
  • Completeness
Benchmarking (with human and model scoring)
Hugging Face
  • Question answering
Benchmarking
Eval Gauntlet
  • Reading comprehension
Benchmarking
Conversations and dialogueMT-benchBenchmarking (with human and model scoring)
Evaluation Harness
  • MuTual
Benchmarking
Hugging Face
  • Conversational
Benchmarking
ParaphrasingBig-bench
  • Paraphrase
Benchmarking
Other response qualitiesFLASK
  • Readability
  • Conciseness
  • Insightfulness
Benchmarking (with human and model scoring)
Big-bench
  • Creativity
Benchmarking
Putting GPT-3's Creativity to the (Alternative Uses) TestBenchmarking (with human scoring)
Miscellaneous text generationHugging Face
  • Fill-mask
  • Text generation
Benchmarking
1.3. ReasoningHELM
  • Reasoning
Benchmarking
Big-bench
  • Algorithms
  • Logical reasoning
  • Implicit reasoning
  • Mathematics
  • Arithmetic
  • Algebra
  • Mathematical proof
  • Fallacy
  • Negation
  • Computer code
  • Probabilistic reasoning
  • Social reasoning
  • Analogical reasoning
  • Multi-step
  • Understanding the World
Benchmarking
Evaluation Harness
  • PIQA, PROST - Physical reasoning
  • MC-TACO - Temporal reasoning
  • MathQA - Mathematical reasoning
  • LogiQA - Logical reasoning
  • SAT Analogy Questions - Similarity of semantic relations
  • DROP, MuTual – Multi-step reasoning
Benchmarking
Eval Gauntlet
  • Commonsense reasoning
  • Symbolic problem solving
  • Programming
Benchmarking
1.4. Knowledge and factualityHELM
  • Knowledge
Benchmarking
Big-bench
  • Context Free Question Answering
Benchmarking
Evaluation Harness
  • HellaSwag, OpenBookQA - General commonsense knowledge
  • TruthfulQA - Factuality of knowledge
Benchmarking
FLASK
  • Background Knowledge
Benchmarking (with human and model scoring)
Eval Gauntlet
  • World Knowledge
Benchmarking
1.5. Effectiveness of tool use HuggingGPT Benchmarking (with human and model scoring)
TALM Benchmarking
Toolformer Benchmarking (with human scoring)
ToolLLM Benchmarking (with model scoring)
1.6. Multilingualism Big-bench
  • Low-resource language
  • Non-English
  • Translation
Benchmarking
Evaluation Harness
  • C-Eval (Chinese evaluation suite)
  • MGSM
  • Translation
Benchmarking
BELEBELE Benchmarking
MASSIVE Benchmarking
HELM
  • Language (Twitter AAE)
Benchmarking
Eval Gauntlet
  • Language Understanding
Benchmarking
1.7. Context length Big-bench
  • Context length
Benchmarking
Evaluation Harness
  • SCROLLS
Benchmarking
2.1. Law LegalBench Benchmarking (with algorithmic and human scoring)
2.2. Medicine Large Language Models Encode Clinical Knowledge Benchmarking (with human scoring)
Towards Generalist Biomedical AI Benchmarking (with human scoring)
2.3. Finance BloombergGPT Benchmarking
3.1. Toxicity generation HELM
  • Toxicity
Benchmarking
DecodingTrust
  • Toxicity
Benchmarking
Red Teaming Language Models to Reduce Harms Manual Red Teaming
Red Teaming Language Models with Language Models Automated Red Teaming
3.2. Bias
Demographical representation HELM Benchmarking
Finding New Biases in Language Models with a Holistic Descriptor Dataset Benchmarking
Stereotype bias HELM
  • Bias
Benchmarking
DecodingTrust
  • Stereotype Bias
Benchmarking
Big-bench
  • Social bias
  • Racial bias
  • Gender bias
  • Religious bias
Benchmarking
Evaluation Harness
  • CrowS-Pairs
Benchmarking
Red Teaming Language Models to Reduce Harms Manual Red Teaming
Fairness DecodingTrust
  • Fairness
Benchmarking
Distributional bias Red Teaming Language Models with Language Models Automated Red Teaming
Representation of subjective opinions Towards Measuring the Representation of Subjective Global Opinions in Language Models Benchmarking
Political bias From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models Benchmarking
The Self-Perception and Political Biases of ChatGPT Benchmarking
Capability fairness HELM
  • Language (Twitter AAE)
Benchmarking
3.3. Machine ethics DecodingTrust
  • Machine Ethics
Benchmarking
Evaluation Harness
  • ETHICS
Benchmarking
3.4. Psychological traits Does GPT-3 Demonstrate Psychopathy? Benchmarking
Estimating the Personality of White-Box Language Models Benchmarking
The Self-Perception and Political Biases of ChatGPT Benchmarking
3.5. Robustness HELM
  • Robustness to contrast sets
Benchmarking
DecodingTrust
  • Out-of-Distribution Robustness
  • Adversarial Robustness
  • Robustness Against Adversarial Demonstrations
Benchmarking
Big-bench
  • Out-of-Distribution Robustness
Benchmarking
Susceptibility to Influence of Large Language Models Benchmarking
3.6. Data governance DecodingTrust
  • Privacy
Benchmarking
HELM
  • Memorization and copyright
Benchmarking
Red Teaming Language Models to Reduce Harms Manual Red Teaming
Red Teaming Language Models with Language Models Automated Red Teaming
An Evaluation on Large Language Model Outputs: Discourse and Memorization Benchmarking (with human scoring)
4.1. Dangerous Capabilities
Offensive cyber capabilities GPT-4 System Card
  • Cybersecurity
    System Card
    Weapons acquisition GPT-4 System Card
    • Proliferation of Conventional and Unconventional Weapons
      System Card
      Self and situation awareness Big-bench
      • Self-Awareness
        Benchmarking
        Autonomous replication / self-proliferation ARC Evals
        • Autonomous replication
          Manual Red Teaming
          Persuasion and manipulation HELM
          • Narrative Reiteration
          • Narrative Wedging
          Benchmarking (with human scoring)
          Big-bench
          • Convince Me (specific task)
          Benchmarking
          Co-writing with Opinionated Language Models Afffects Users' Views Manual Red Teaming
          5.1. Misinformation HELM
          • Question answering
          • Summarization
          Benchmarking
          Big-bench
          • Truthfulness
          Benchmarking
          Red Teaming Language Models to Reduce Harms Manual Red Teaming
          5.2. Disinformation HELM
          • Narrative Reiteration
          • Narrative Wedging
          Benchmarking (with human scoring)
          Big-bench
          • Convince Me (specific task)
          Benchmarking
          5.3. Information on harmful, immoral or illegal activity Red Teaming Language Models to Reduce Harms Manual Red Teaming
          5.4. Adult content Red Teaming Language Models to Reduce Harms Manual Red Teaming