/llm-benchmark

A list of LLM benchmark frameworks.

Apache License 2.0Apache-2.0

llm-benchmark

A list of comprehensive LLM evaluation frameworks. Contributions welcome!

Benchmark Release Date Repository Paper/Blog Dataset Number Aspect Licence
HELM --- https://github.com/stanford-crfm/helm Holistic Evaluation of Language Models 42 --- ---
BIG-bench --- https://github.com/google/BIG-bench Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models 214 --- ---
BigBIO --- https://github.com/bigscience-workshop/biomedical BigBio: A Framework for Data-Centric Biomedical Natural Language Processing 126 --- ---
BigScience Evaluation --- https://github.com/bigscience-workshop/evaluation --- 28 --- ---
Language Model Evaluation Harness --- https://github.com/EleutherAI/lm-evaluation-harness Evaluating Large Language Models (LLMs) with Eleuther AI Evaluating LLMs 56 --- ---
Scholar Evals --- https://github.com/scholar-org/scholar-evals --- --- --- ---
Code Generation LM Evaluation Harness --- https://github.com/bigcode-project/bigcode-evaluation-harness --- 13 --- ---
Chatbot Arena --- https://github.com/lm-sys/FastChat --- --- --- ---
GLUE --- https://github.com/nyu-mll/jiant --- 11 --- ---
SuperGLUE --- https://github.com/nyu-mll/jiant --- 10 --- ---
CLUE --- https://github.com/CLUEbenchmark/CLUE --- 9 --- ---
CodeXGLUE --- https://github.com/microsoft/CodeXGLUE --- 10 --- ---