Pinned Repositories
BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
CompassJudger
DevEval
A Comprehensive Benchmark for Software Development.
GAOKAO-Eval
LawBench
Benchmarking Legal Knowledge of Large Language Models
MixtralKit
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
MMBench
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
OpenCompass's Repositories
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
open-compass/MixtralKit
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
open-compass/LawBench
Benchmarking Legal Knowledge of Large Language Models
open-compass/T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
open-compass/MMBench
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
open-compass/BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
open-compass/GAOKAO-Eval
open-compass/DevEval
A Comprehensive Benchmark for Software Development.
open-compass/MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
open-compass/CompassJudger
open-compass/GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
open-compass/OpenFinData
open-compass/Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
open-compass/ANAH
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
open-compass/CriticEval
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
open-compass/GPassK
Official Repository of Are Your LLMs Capable of Stable Reasoning?
open-compass/ProSA
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
open-compass/code-evaluator
A multi-language code evaluation tool.
open-compass/Creation-MMBench
Assessing Context-Aware Creative Intelligence in MLLMs
open-compass/CIBench
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
open-compass/CompassBench
Demo data of CompassBench
open-compass/human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
open-compass/CodeBench
open-compass/lagent-cibench
open-compass/evalplus
EvalPlus for rigourous evaluation of LLM-synthesized code
open-compass/.github
open-compass/hinode
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
open-compass/oc_doc_website
open-compass/storage