OpenCompass

China

Pinned Repositories

BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Language:Jupyter Notebook156 2 26
CompassJudger
92 4 25
DevEval
A Comprehensive Benchmark for Software Development.
Language:Python101 3 27
GAOKAO-Eval
Language:Jupyter Notebook104 2 75
LawBench
Benchmarking Legal Knowledge of Large Language Models
Language:Python364 8 1463
MixtralKit
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
Language:Python769 8 1677
MMBench
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
248 3 4112
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Language:Python6.1k 32 750664
T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Language:Python287 3 5516
VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Language:Python3.1k 13 485497

OpenCompass's Repositories

open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Language:Python6.1k 32 750664
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Language:Python3.1k 13 485497
open-compass/MixtralKit
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
Language:Python769 8 1677
open-compass/LawBench
Benchmarking Legal Knowledge of Large Language Models
Language:Python364 8 1463
open-compass/T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Language:Python287 3 5516
open-compass/MMBench
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
248 3 4112
open-compass/BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Language:Jupyter Notebook156 2 26
open-compass/GAOKAO-Eval
Language:Jupyter Notebook104 2 75
open-compass/DevEval
A Comprehensive Benchmark for Software Development.
Language:Python101 3 27
open-compass/MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
97 2 131
open-compass/CompassJudger
92 4 25
open-compass/GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
Language:Python82 6 27
open-compass/OpenFinData
61 3 33
open-compass/Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
Language:Python53 3 33
open-compass/ANAH
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
Language:Python53 2 64
open-compass/CriticEval
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Language:Python39 3 32
open-compass/GPassK
Official Repository of Are Your LLMs Capable of Stable Reasoning?
Language:Python24 2 21
open-compass/ProSA
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Language:Python24 3 02
open-compass/code-evaluator
A multi-language code evaluation tool.
Language:Python22 2 09
open-compass/Creation-MMBench
Assessing Context-Aware Creative Intelligence in MLLMs
Language:JavaScript15
open-compass/CIBench
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
Language:Python10 2 12
open-compass/CompassBench
Demo data of CompassBench
7 1 43
open-compass/human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
Language:Python3 1 05
open-compass/CodeBench
2 2 00
open-compass/lagent-cibench
Language:Python2 2 01
open-compass/evalplus
EvalPlus for rigourous evaluation of LLM-synthesized code
Language:Python1 1 0
open-compass/.github
0 3 01
open-compass/hinode
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
Language:HTML0 0
open-compass/oc_doc_website
open-compass/storage
2 0