llm-benchmark A list of comprehensive LLM evaluation frameworks. Contributions welcome! Benchmark Release Date Repository Paper/Blog Dataset Number Aspect Licence HELM --- https://github.com/stanford-crfm/helm Holistic Evaluation of Language Models 42 --- --- BIG-bench --- https://github.com/google/BIG-bench Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models 214 --- --- BigBIO --- https://github.com/bigscience-workshop/biomedical BigBio: A Framework for Data-Centric Biomedical Natural Language Processing 126 --- --- BigScience Evaluation --- https://github.com/bigscience-workshop/evaluation --- 28 --- --- Language Model Evaluation Harness --- https://github.com/EleutherAI/lm-evaluation-harness Evaluating Large Language Models (LLMs) with Eleuther AI Evaluating LLMs 56 --- --- Scholar Evals --- https://github.com/scholar-org/scholar-evals --- --- --- --- Code Generation LM Evaluation Harness --- https://github.com/bigcode-project/bigcode-evaluation-harness --- 13 --- --- Chatbot Arena --- https://github.com/lm-sys/FastChat --- --- --- --- GLUE --- https://github.com/nyu-mll/jiant --- 11 --- --- SuperGLUE --- https://github.com/nyu-mll/jiant --- 10 --- --- CLUE --- https://github.com/CLUEbenchmark/CLUE --- 9 --- --- CodeXGLUE --- https://github.com/microsoft/CodeXGLUE --- 10 --- ---