Awesome-LLM-Eval: a curated list of tools, demos, papers, docs for Evaluation on Large Language Models (like ChatGPT, LLaMA, GLM, etc).
名称 | 机构 | 网址 | 日期 |
---|---|---|---|
EVAL | OPENAI | https://github.com/openai/evals | |
lm-evaluation-harness | EleutherAI | https://github.com/EleutherAI/lm-evaluation-harness | |
Large language model evaluation and workflow framework from Phase AI | wgryc | https://github.com/wgryc/phasellm | |
Evaluation benchmark for large language models | FreedomIntelligence | https://github.com/FreedomIntelligence/LLMZoo | |
Holistic Evaluation of Language Models (HELM) | Stanford | https://github.com/stanford-crfm/helm | |
A lightweight evaluation tool for question-answering | Langchain | https://github.com/rlancemartin/auto-evaluator | |
PandaLM: ReProducible and Automated Language Model Assessment | WeOpenML | https://github.com/WeOpenML/PandaLM | |
FlagEval | Tsinghua University | https://github.com/FlagOpen/FlagEval | |
AlpacaEval | tatsu-lab | https://github.com/tatsu-lab/alpaca_eval |
数据名称 | 机构 | 网址 | 简介 |
---|---|---|---|
M3Exam | DAMO | M3Exam | A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. |
KoLA | THU-KEG | KoLA | Knowledge-oriented LLM Assessment benchmark (KoLA), is hosted by Knowledge Engineering Group, Tsinghua University (THU-KEG), which aims at carefully benchmarking the world knowledge of LLMs by undertaking meticulous designs considering data, ability taxonomy and evaluation metric. |
promptbench | microsoft | promptbench | PromptBench is a powerful tool designed to scrutinize and analyze the interaction of large language models with various prompts. It provides a convenient infrastructure to simulate black-box adversarial prompt attacks on the models and evaluate their performances. This repository hosts the necessary codebase, datasets, and instructions to facilitate these experiments. |
OpenCompass | Shanghai AI Lab | OpenCompass | OpenCompass is an LLM evaluation platform, supporting evaluation of 20+ models over 50+ datasets, that enables fast, comprehensive benchmarking of large models using efficient distributed evaluation techniques. |
JioNLP-LLM评测数据集 | jionlp | JioNLP-LLM评测数据集 | LLM 评测数据集主要用于评测通用 LLM 的效果评价。着眼考察 LLM 模型对人类用户的帮助效果、辅助能力,可否达到一个【智能助手】的水平。题型包括:选择题来源于**大陆国内各种专业性考试,重点在于考察模型对客观知识的覆盖面,占比 32%;主观题来源于日常总结,主要考察用户对 LLM 常用功能的效果 |
BIG-bench | BIG-bench | BIG bench由 204 项任务组成,任务主题涉及语言学、儿童发展、数学、常识推理、生物学、物理学、社会偏见、软件开发等等领域的问题 | |
BIG-Bench-Hard | Stanford NLP | BIG-Bench-Hard | A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater |
SuperCLUE | CLUEbenchmark | SuperCLUE | 中文的一个榜单,这里从基础能力、专业能力、中文特性三个角度进行准备测试集 基础能力能力包括:语义理解、对话、逻辑推理、角色模拟、代码、生成与创作等10项能力。专业能力包括:包括了中学、大学与专业考试,涵盖了从数学、物理、地理到社会科学等50多项能力。中文特性能力:针对有中文特点的任务,包括了中文成语、诗歌、文学、字形等10项多种能力 |
Safety Eval | 清华大学 | Safety Eval 安全大模型评测 | 清华收集的一个评测集,涵盖了仇恨言论、偏见歧视言论、犯罪违法、隐私、伦理道德等八大类别,包括细粒度划分的40余个二级安全类别,并依托于一套系统的安全评测框架 |
GAOKAO-Bench | OpenLMLab | GAOKAO-Bench | GAOKAO-bench是一个以**高考题目为数据集,测评大模型语言理解能力、逻辑推理能力的测评框架 |
Gaokao | ExpressAI | Gaokao | “高考基准”旨在评估和追踪我们在达到人类智力水平方面取得的进展。它不仅可以提供对现实世界场景中实际有用的不同任务和领域的全面评估,还提供丰富的人类表现,以便大模型等可以直接与人类进行比较 |
MMLU | paperswithcode.com | MMLU | 该测评数据集涵盖 STEM、人文学科、社会科学等领域的 57 个学科。难度从初级到专业高级,既考验世界知识,又考验解决问题的能力。学科范围从数学和历史等传统领域到法律和伦理等更专业的领域。主题的粒度和广度使基准成为识别模型盲点的理想选择 |
CMMLU | MBZUAI & ShangHai JiaoTong & Microsoft | CMMLU | Measuring massive multitask language understanding in Chinese |
MMCU | 甲骨易AI研究院 | MMCU | 甲骨易AI研究院提出一种衡量中文大模型处理多任务准确度的测试, 数据集的测试内容涵盖四大领域:医疗、法律、心理学和教育。题目的数量达到1万+,其中包括医疗领域2819道题,法律领域3695道题,心理学领域2001道,教育领域3331道 |
AGIEval | 微软研究院 | AGIEval | 由微软研究院发起,旨在全面评估基础模型在人类认知和问题解决相关任务上的能力,包含了**的高考、司法考试,以及美国的SAT、LSAT、GRE和GMAT等20个公开且严谨的官方入学和职业资格考试 |
C_Eval | 上交、清华以及爱丁堡大学 | C_Eval | 上交、清华以及爱丁堡大学合作产出的一个评测集,包含52个学科来评估大模型高级知识和推理能力,其评估了包含 GPT-4、ChatGPT、Claude、LLaMA、Moss 等多个模型的性能。 |
XieZhi | Fudan Univesity | XieZhi | A comprehensive evaluation suite for Language Models (LMs). It consists of 249587 multi-choice questions spanning 516 diverse disciplines and four difficulty levels. 新的领域知识综合评估基准测试:Xiezhi。对于多选题,Xiezhi涵盖了516种不同学科中的220,000个独特问题,其中涵盖了13个学科。作者还提出了Xiezhi-Specialty和Xiezhi-Interdiscipline,每个都含有15k个问题。使用Xiezhi基准测试评估了47种先进的LLMs的性能 |
MT-bench | UC Berkeley, UCSD, CMU, Stanford, MBZUAI | MT-bench | A benchmark consisting of 80 high-quality multi-turn questions. MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models. It includes 8 common categories of user prompts to guide its construction: writing, roleplay, extraction, reasoning, math, coding, etc. |
GLUE Benchmark | NYU, University of Washington, DeepMind, Facebook AI Research, Allen Institute for AI, Google AI Language | GLUE Benchmark | 评估模型在语法、改写、文本相似度、推理、文本蕴含、代词指代等任务上的表现 |
OpenAI Moderation API | OpenAI | OpenAI Moderation API | 过滤有害或不安全的内容 |
GSM8K | OpenAI | GSM8K | GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. GSM8K segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. |
EleutherAI LM Eval | EleutherAI | EleutherAI LM Eval | 评估模型在少量样本下的表现和在多种任务上的微调效果 |
OpenAI Evals | OpenAI | OpenAI Evals | 评估生成文本的准确性、多样性、一致性、鲁棒性、迁移性、效率和公平性 |
AlpacaEval | tatsu-lab | AlpacaEval | An LLM-based automatic evaluation that is fast, cheap, and reliable. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. These responses are then compared to reference Davinci003 responses by the provided GPT-4 or Claude or ChatGPT based auto-annotators, which results in the win rates presented above. |
Adversarial NLI (ANLI) | Facebook AI Research, New York University, Johns Hopkins University, University of Maryland, Allen Institute for AI | Adversarial NLI (ANLI) | 评估模型在对抗样本下的鲁棒性、泛化能力、推理解释能力和一致性,以及资源使用效率(内存使用、推理时间和训练时间) |
LIT (Language Interpretability Tool) | LIT | 提供一个平台,可以根据用户定义的指标进行评估,分析模型的优势、弱点和潜在偏差 | |
ParlAI | Facebook AI Research | ParlAI | 评估模型在准确性、F1分数、困惑度(模型预测序列中下一个词的能力)、人类评价(相关性、流畅度和连贯性)、速度和资源利用率、鲁棒性(模型在不同条件下的表现,如噪声输入、对抗攻击或数据质量变化)、泛化能力等方面的表现 |
CoQA | Stanford NLP Group | CoQA | 评估模型在理解文本段落并回答一系列相互关联的问题,这些问题出现在对话中的能力 |
LAMBADA | University of Trento and Fondazione Bruno Kessler | LAMBADA | 评估模型使用预测段落最后一个词的方式来衡量长期理解能力 |
HellaSwag | University of Washington and Allen Institute for AI | HellaSwag | 评估模型的推理能力 |
LogiQA | Tsinghua University and Microsoft Research Asia | LogiQA | 评估模型的逻辑推理能力 |
MultiNLI | New York University, DeepMind, Facebook AI Research, Allen Institute for AI, Google AI Language | MultiNLI | 评估模型在不同文体之间理解句子关系的能力 |
SQUAD | Stanford NLP Group | SQUAD | 评估模型在阅读理解任务上的表现 |
Open LLM Leaderboard | HuggingFace | Leaderboard | 由HuggingFace组织的一个LLM评测榜单,目前已评估了较多主流的开源LLM模型。评估主要包括AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA四个数据集上的表现,主要以英文为主 |
chinese-llm-benchmark | jeinlee1991 | llm-benchmark | 中文大模型能力评测榜单:覆盖百度文心一言、chatgpt、阿里通义千问、讯飞星火、belle / chatglm6b 等开源大模型,多维度能力评测。不仅提供能力评分排行榜,也提供所有模型的原始输出结果 |
AlpacaEval | tatsu-lab | AlpacaEval | 开源模型领先者 vicuna、openchat 和 wizardlm 的基于LLM的自动评估 |
Huggingface开源LLM排行榜 | huggingface | HF开源LLM排行榜 | 仅评估开源模型,在 Eleuther AI 的四个评估集上排名,Falcon 夺冠,vicuna 亦夺冠 |
lmsys-arena | Berkley | lmsys排名榜 | 使用 Elo 评分机制,排名为 GPT4 > Claude > GPT3.5 > Vicuna > 其他 |
CMU开源聊天机器人评测应用 | CMU | zeno-build | 在对话场景中进行训练可能很重要,排名为 ChatGPT > Vicuna > 其他 |
Z-Bench中文真格基金评测 | 真格基金 | Z-Bench | 国产中文模型的编程可用性相对较低,模型水平差异不大,两个版本的 ChatGLM 有明显提升 |
Chain-of-thought评估 | Yao Fu | COT评估 | 包括 GSM8k、MATH 等复杂问题排名 |
InfoQ大模型综合能力评估 | InfoQ | InfoQ评测 | 面向中文,排名为 ChatGPT > 文心一言 > Claude > 星火 |
ToolBench工具调用评测 | 智源/清华 | ToolBench | 通过与工具微调模型和 ChatGPT 进行比较,提供评测脚本 |
AgentBench推理决策评估榜单 | THUDM | AgentBench | 清华联合多所高校推出,涵盖不同任务环境,如购物、家居、操作系统等场景下模型的推理决策能力 |
FlagEval | 智源/清华 | FlagEval | 智源出品,结合主观和客观评分,提供了LLM的评分榜单 |
ChatEval | THU-NLP | ChatEval | ChatEval旨在简化人类对生成的文本进行评估的过程。当给定不同的文本片段时,ChatEval中的角色(由法学硕士扮演)可以自主地讨论细微差别和差异,利用他们指定的角色,随后提供他们的判断 |
Zhujiu | Institute of Automation, CAS | Zhujiu | 多维能力覆盖,涵盖了7个能力维度和51个任务;多方面的评估方法协作,综合使用3种不同但互补的评估方法;全面的中文基准测试,同时提供英文评估能力 |
LucyEval | 甲骨文 | LucyEval | 中文大语言模型成熟度评测——LucyEval,能够通过对模型各方面能力的客观测试,找到模型的不足,帮助设计者和工程师更加精准地调整、训练模型,助力大模型不断迈向更智能的未来 |
- Chat Arena: anonymous models side-by-side and vote for which one is better - 开源AI大模型“匿名”竞技场!你在这里可以成为一名裁判,给两个事先不知道名字的模型回答打分,评分后将给出他们的真实身份。目前已经“参赛”的选手包括Vicuna、Koala、OpenAssistant (oasst)、Dolly、ChatGLM、StableLM、Alpaca、LLaMA等。
Models | MMLU | CEval | M3KE | Xiezhi-Spec.-Chinese | Xiezhi-Inter.-Chinese | Xiezhi-Spec.-English | Xiezhi-Inter.-English | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | |
Random-Guess | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 |
Generation Probability For Ranking | |||||||||||||||||||
Bloomz-560m | 0.111 | 0.109 | 0.119 | 0.124 | 0.117 | 0.103 | 0.126 | 0.123 | 0.127 | 0.124 | 0.130 | 0.138 | 0.140 | 0.113 | 0.116 | 0.123 | 0.124 | 0.117 | 0.160 |
Bloomz-1b1 | 0.131 | 0.116 | 0.128 | 0.107 | 0.115 | 0.110 | 0.082 | 0.138 | 0.108 | 0.107 | 0.117 | 0.125 | 0.123 | 0.130 | 0.119 | 0.114 | 0.144 | 0.129 | 0.145 |
Bloomz-1b7 | 0.107 | 0.117 | 0.164 | 0.054 | 0.058 | 0.103 | 0.102 | 0.165 | 0.151 | 0.159 | 0.152 | 0.214 | 0.170 | 0.133 | 0.140 | 0.144 | 0.150 | 0.149 | 0.209 |
Bloomz-3b | 0.139 | 0.084 | 0.146 | 0.168 | 0.182 | 0.194 | 0.063 | 0.186 | 0.154 | 0.168 | 0.151 | 0.180 | 0.182 | 0.201 | 0.155 | 0.156 | 0.175 | 0.164 | 0.158 |
Bloomz-7b1 | 0.167 | 0.160 | 0.205 | 0.074 | 0.072 | 0.073 | 0.073 | 0.154 | 0.178 | 0.162 | 0.148 | 0.160 | 0.156 | 0.176 | 0.153 | 0.207 | 0.217 | 0.204 | 0.229 |
Bloomz-7b1-mt | 0.189 | 0.196 | 0.210 | 0.077 | 0.078 | 0.158 | 0.072 | 0.163 | 0.175 | 0.154 | 0.155 | 0.195 | 0.164 | 0.180 | 0.146 | 0.219 | 0.228 | 0.171 | 0.232 |
Bloomz-7b1-p3 | 0.066 | 0.059 | 0.075 | 0.071 | 0.070 | 0.072 | 0.081 | 0.177 | 0.198 | 0.158 | 0.183 | 0.173 | 0.170 | 0.130 | 0.130 | 0.162 | 0.157 | 0.132 | 0.134 |
Bloomz | 0.051 | 0.066 | 0.053 | 0.142 | 0.166 | 0.240 | 0.098 | 0.185 | 0.133 | 0.277 | 0.161 | 0.099 | 0.224 | 0.069 | 0.082 | 0.056 | 0.058 | 0.055 | 0.049 |
Bloomz-mt | 0.266 | 0.264 | 0.248 | 0.204 | 0.164 | 0.151 | 0.161 | 0.253 | 0.198 | 0.212 | 0.213 | 0.189 | 0.184 | 0.379 | 0.396 | 0.394 | 0.383 | 0.405 | 0.398 |
Bloomz-p3 | 0.115 | 0.093 | 0.057 | 0.118 | 0.137 | 0.140 | 0.115 | 0.136 | 0.095 | 0.105 | 0.086 | 0.065 | 0.098 | 0.139 | 0.097 | 0.069 | 0.176 | 0.141 | 0.070 |
llama-7b | 0.125 | 0.132 | 0.093 | 0.133 | 0.106 | 0.110 | 0.158 | 0.152 | 0.141 | 0.117 | 0.142 | 0.135 | 0.128 | 0.159 | 0.165 | 0.161 | 0.194 | 0.183 | 0.176 |
llama-13b | 0.166 | 0.079 | 0.135 | 0.152 | 0.181 | 0.169 | 0.131 | 0.133 | 0.241 | 0.243 | 0.211 | 0.202 | 0.303 | 0.154 | 0.183 | 0.215 | 0.174 | 0.216 | 0.231 |
llama-30b | 0.076 | 0.107 | 0.073 | 0.079 | 0.119 | 0.082 | 0.079 | 0.140 | 0.206 | 0.162 | 0.186 | 0.202 | 0.183 | 0.110 | 0.195 | 0.161 | 0.088 | 0.158 | 0.219 |
llama-65b | 0.143 | 0.121 | 0.100 | 0.154 | 0.141 | 0.168 | 0.125 | 0.142 | 0.129 | 0.084 | 0.108 | 0.077 | 0.077 | 0.183 | 0.204 | 0.172 | 0.133 | 0.191 | 0.157 |
baize-7b~(lora) | 0.129 | 0.091 | 0.079 | 0.194 | 0.180 | 0.206 | 0.231 | 0.216 | 0.148 | 0.123 | 0.173 | 0.158 | 0.198 | 0.182 | 0.190 | 0.194 | 0.218 | 0.188 | 0.209 |
baize-7b-healthcare~(lora) | 0.130 | 0.121 | 0.106 | 0.178 | 0.174 | 0.178 | 0.203 | 0.178 | 0.146 | 0.123 | 0.266 | 0.107 | 0.118 | 0.175 | 0.164 | 0.173 | 0.197 | 0.231 | 0.198 |
baize-13b~(lora) | 0.131 | 0.111 | 0.171 | 0.184 | 0.178 | 0.195 | 0.155 | 0.158 | **0.221 ** | 0.256 | 0.208 | 0.200 | 0.219 | 0.176 | 0.189 | 0.239 | 0.187 | 0.185 | 0.274 |
baize-30b~(lora) | 0.193 | 0.216 | 0.207 | 0.191 | 0.196 | 0.121 | 0.071 | 0.109 | 0.212 | 0.190 | 0.203 | 0.256 | 0.200 | 0.167 | 0.235 | 0.168 | 0.072 | 0.180 | 0.193 |
Belle-0.2M | 0.127 | 0.148 | 0.243 | 0.053 | 0.063 | 0.136 | 0.076 | 0.172 | 0.126 | 0.153 | 0.171 | 0.165 | 0.147 | 0.206 | 0.146 | 0.148 | 0.217 | 0.150 | 0.173 |
Belle-0.6M | 0.091 | 0.114 | 0.180 | 0.082 | 0.080 | 0.090 | 0.075 | 0.188 | 0.149 | 0.198 | 0.188 | 0.188 | 0.175 | 0.173 | 0.172 | 0.183 | 0.193 | 0.184 | 0.196 |
Belle-1M | 0.137 | 0.126 | 0.162 | 0.066 | 0.065 | 0.072 | 0.066 | 0.170 | 0.152 | 0.147 | 0.173 | 0.176 | 0.197 | 0.211 | 0.137 | 0.149 | 0.207 | 0.151 | 0.185 |
Belle-2M | 0.127 | 0.148 | 0.132 | 0.058 | 0.063 | 0.136 | 0.057 | 0.163 | 0.166 | 0.130 | 0.159 | 0.177 | 0.163 | 0.155 | 0.106 | 0.166 | 0.151 | 0.150 | 0.138 |
chatglm-6B | 0.099 | 0.109 | 0.112 | 0.084 | 0.074 | 0.114 | 0.115 | 0.082 | 0.097 | 0.147 | 0.104 | 0.111 | 0.144 | 0.106 | 0.120 | 0.124 | 0.099 | 0.079 | 0.097 |
doctorglm-6b | 0.093 | 0.076 | 0.065 | 0.037 | 0.085 | 0.051 | 0.038 | 0.062 | 0.068 | 0.044 | 0.047 | 0.056 | 0.043 | 0.069 | 0.053 | 0.043 | 0.106 | 0.059 | 0.059 |
moss-base-16B | 0.072 | 0.050 | 0.062 | 0.115 | 0.048 | 0.052 | 0.099 | 0.105 | 0.051 | 0.059 | 0.123 | 0.054 | 0.058 | 0.124 | 0.077 | 0.080 | 0.121 | 0.058 | 0.063 |
moss-sft-16B | 0.064 | 0.065 | 0.051 | 0.063 | 0.062 | 0.072 | 0.075 | 0.072 | 0.067 | 0.068 | 0.073 | 0.081 | 0.066 | 0.071 | 0.070 | 0.059 | 0.074 | 0.084 | 0.075 |
vicuna-7b | 0.051 | 0.051 | 0.029 | 0.063 | 0.071 | 0.064 | 0.059 | 0.169 | 0.171 | 0.165 | 0.134 | 0.201 | 0.213 | 0.182 | 0.209 | 0.195 | 0.200 | 0.214 | 0.182 |
vicuna-13b | 0.109 | 0.104 | 0.066 | 0.060 | 0.131 | 0.131 | 0.067 | 0.171 | 0.167 | 0.166 | 0.143 | 0.147 | 0.178 | 0.121 | 0.139 | 0.128 | 0.158 | 0.174 | 0.191 |
alpaca-7b | 0.135 | 0.170 | 0.202 | 0.137 | 0.119 | 0.113 | 0.142 | 0.129 | 0.139 | 0.123 | 0.178 | 0.104 | 0.097 | 0.189 | 0.179 | 0.128 | 0.200 | 0.185 | 0.149 |
pythia-1.4b | 0.124 | 0.127 | 0.121 | 0.108 | 0.132 | 0.138 | 0.083 | 0.125 | 0.128 | 0.135 | 0.111 | 0.146 | 0.135 | 0.158 | 0.124 | 0.124 | 0.166 | 0.126 | 0.118 |
pythia-2.8b | 0.103 | 0.110 | 0.066 | 0.064 | 0.089 | 0.122 | 0.086 | 0.114 | 0.120 | 0.131 | 0.091 | 0.113 | 0.112 | 0.126 | 0.118 | 0.112 | 0.110 | 0.145 | 0.107 |
pythia-6.9b | 0.115 | 0.070 | 0.084 | 0.078 | 0.073 | 0.094 | 0.073 | 0.086 | 0.094 | 0.092 | 0.097 | 0.098 | 0.085 | 0.091 | 0.088 | 0.083 | 0.099 | 0.099 | 0.096 |
pythia-12b | 0.075 | 0.059 | 0.066 | 0.077 | 0.097 | 0.078 | 0.098 | 0.102 | 0.126 | 0.132 | 0.125 | 0.147 | 0.159 | 0.079 | 0.098 | 0.110 | 0.094 | 0.120 | 0.120 |
gpt-neox-20b | 0.081 | 0.132 | 0.086 | 0.086 | 0.096 | 0.069 | 0.094 | 0.140 | 0.103 | 0.109 | 0.120 | 0.098 | 0.085 | 0.088 | 0.101 | 0.116 | 0.099 | 0.113 | 0.156 |
h2ogpt-12b | 0.075 | 0.087 | 0.078 | 0.080 | 0.078 | 0.094 | 0.070 | 0.065 | 0.047 | 0.073 | 0.076 | 0.061 | 0.091 | 0.088 | 0.050 | 0.065 | 0.105 | 0.063 | 0.067 |
h2ogpt-20b | 0.114 | 0.098 | 0.110 | 0.094 | 0.084 | 0.061 | 0.096 | 0.108 | 0.080 | 0.073 | 0.086 | 0.081 | 0.072 | 0.108 | 0.068 | 0.086 | 0.109 | 0.071 | 0.079 |
dolly-3b | 0.066 | 0.060 | 0.055 | 0.079 | 0.083 | 0.077 | 0.066 | 0.100 | 0.090 | 0.083 | 0.091 | 0.093 | 0.085 | 0.079 | 0.063 | 0.077 | 0.076 | 0.074 | 0.084 |
dolly-7b | 0.095 | 0.068 | 0.052 | 0.091 | 0.079 | 0.070 | 0.108 | 0.108 | 0.089 | 0.092 | 0.111 | 0.095 | 0.100 | 0.096 | 0.059 | 0.086 | 0.123 | 0.085 | 0.090 |
dolly-12b | 0.095 | 0.068 | 0.093 | 0.085 | 0.071 | 0.073 | 0.114 | 0.098 | 0.106 | 0.103 | 0.094 | 0.114 | 0.106 | 0.086 | 0.088 | 0.098 | 0.088 | 0.102 | 0.116 |
stablelm-3b | 0.070 | 0.085 | 0.071 | 0.086 | 0.082 | 0.099 | 0.096 | 0.101 | 0.087 | 0.091 | 0.083 | 0.092 | 0.067 | 0.069 | 0.089 | 0.081 | 0.066 | 0.085 | 0.088 |
stablelm-7b | 0.158 | 0.118 | 0.093 | 0.133 | 0.102 | 0.093 | 0.140 | 0.085 | 0.118 | 0.122 | 0.123 | 0.130 | 0.095 | 0.123 | 0.103 | 0.100 | 0.134 | 0.121 | 0.105 |
falcon-7b | 0.048 | 0.046 | 0.051 | 0.046 | 0.051 | 0.052 | 0.050 | 0.077 | 0.096 | 0.112 | 0.129 | 0.141 | 0.142 | 0.124 | 0.103 | 0.107 | 0.198 | 0.200 | 0.205 |
falcon-7b-instruct | 0.078 | 0.095 | 0.106 | 0.114 | 0.095 | 0.079 | 0.104 | 0.075 | 0.083 | 0.087 | 0.060 | 0.133 | 0.123 | 0.160 | 0.203 | 0.156 | 0.141 | 0.167 | 0.152 |
falcon-40b | 0.038 | 0.043 | 0.077 | 0.085 | 0.090 | 0.129 | 0.087 | 0.069 | 0.056 | 0.053 | 0.065 | 0.063 | 0.058 | 0.059 | 0.077 | 0.066 | 0.085 | 0.063 | 0.076 |
falcon-40b-instruct | 0.126 | 0.123 | 0.121 | 0.070 | 0.080 | 0.068 | 0.141 | 0.103 | 0.085 | 0.079 | 0.115 | 0.082 | 0.081 | 0.118 | 0.143 | 0.124 | 0.083 | 0.108 | 0.104 |
Instruction For Ranking | |||||||||||||||||||
ChatGPT | 0.240 | 0.298 | 0.371 | 0.286 | 0.289 | 0.360 | 0.290 | 0.218 | 0.352 | 0.414 | 0.266 | 0.418 | 0.487 | 0.217 | 0.361 | 0.428 | 0.305 | 0.452 | 0.517 |
GPT-4 | 0.402 | 0.415 | 0.517 | 0.413 | 0.410 | 0.486 | 0.404 | 0.392 | 0.429 | 0.490 | 0.453 | 0.496 | 0.565 | 0.396 | 0.434 | 0.495 | 0.463 | 0.506 | 0.576 |
Statistic | |||||||||||||||||||
Performance-Average | 0.120 | 0.117 | 0.125 | 0.113 | 0.114 | 0.124 | 0.111 | 0.140 | 0.140 | 0.145 | 0.144 | 0.148 | 0.152 | 0.145 | 0.145 | 0.150 | 0.156 | 0.157 | 0.166 |
Performance-Variance | 0.062 | 0.068 | 0.087 | 0.067 | 0.065 | 0.078 | 0.064 | 0.058 | 0.070 | 0.082 | 0.067 | 0.082 | 0.095 | 0.067 | 0.080 | 0.090 | 0.078 | 0.092 | 0.104 |
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,
by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning,
Hallucination, and Interactivity,
by Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji et al. - Is ChatGPT a General-Purpose Natural Language Processing Task Solver?,
by Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, Diyi - ChatGPT versus Traditional Question Answering for Knowledge Graphs:
Current Status and Future Directions Towards Knowledge Graph Chatbots,
by Reham Omar, Omij Mangukiya, Panos Kalnis and Essam Mansour - Mathematical Capabilities of ChatGPT,
by Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius Berner - Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization,
by Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei Cheng - On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
Perspective,
by Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang et al. - ChatGPT is not all you need. A State of the Art Review of large
Generative AI models,
by Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merch'an - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned
BERT,
by Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du and Dacheng Tao - Evaluation of ChatGPT as a Question Answering System for Answering
Complex Questions,
by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen and Guilin Qi - ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models,
by Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu and Ben He - Holistic Evaluation of Language Models,
by Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan et al. - Evaluating the Text-to-SQL Capabilities of Large Language Models,
by Nitarshan Rajkumar, Raymond Li and Dzmitry Bahdanau - Are Visual-Linguistic Models Commonsense Knowledge Bases?,
by Hsiu-Yu Yang and Carina Silberer - Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological
Perspective,
by Xingxuan Li, Yutong Li, Linlin Liu, Lidong Bing and Shafiq R. Joty - GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained
Language Models,
by Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li and Kai-Wei Chang - RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness
of Deductive Reasoners,
by Soumya Sanyal, Zeyi Liao and Xiang Ren - A Systematic Evaluation of Large Language Models of Code,
by Frank F. Xu, Uri Alon, Graham Neubig and Vincent J. Hellendoorn - Evaluating Large Language Models Trained on Code,
by Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond'e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda et al. - GLGE: A New General Language Generation Evaluation Benchmark,
by Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al. - Evaluating Pre-Trained Models for User Feedback Analysis in Software
Engineering: A Study on Classification of App-Reviews,
by Mohammad Abdul Hadi and Fatemeh H. Fard - Do Language Models Perform Generalizable Commonsense Inference?,
by Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang Ren - RICA: Evaluating Robust Inference Capabilities Based on Commonsense
Axioms,
by Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara and Xiang Ren - Evaluation of Text Generation: A Survey,
by Asli Celikyilmaz, Elizabeth Clark and Jianfeng Gao - Neural Language Generation: Formulation, Methods, and Evaluation,
by Cristina Garbacea and Qiaozhu Mei - BERTScore: Evaluating Text Generation with BERT,
by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Switch Transformer | 1.6T | Decoder(MOE) | - | 2021-01 | Paper |
GLaM | 1.2T | Decoder(MOE) | - | 2021-12 | Paper |
PaLM | 540B | Decoder | - | 2022-04 | Paper |
MT-NLG | 530B | Decoder | - | 2022-01 | Paper |
J1-Jumbo | 178B | Decoder | api | 2021-08 | Paper |
OPT | 175B | Decoder | api | ckpt | 2022-05 | Paper |
BLOOM | 176B | Decoder | api | ckpt | 2022-11 | Paper |
GPT 3.0 | 175B | Decoder | api | 2020-05 | Paper |
LaMDA | 137B | Decoder | - | 2022-01 | Paper |
GLM | 130B | Decoder | ckpt | 2022-10 | Paper |
YaLM | 100B | Decoder | ckpt | 2022-06 | Blog |
LLaMA | 65B | Decoder | ckpt | 2022-09 | Paper |
GPT-NeoX | 20B | Decoder | ckpt | 2022-04 | Paper |
UL2 | 20B | agnostic | ckpt | 2022-05 | Paper |
鹏程.盘古α | 13B | Decoder | ckpt | 2021-04 | Paper |
T5 | 11B | Encoder-Decoder | ckpt | 2019-10 | Paper |
CPM-Bee | 10B | Decoder | api | 2022-10 | Paper |
rwkv-4 | 7B | RWKV | ckpt | 2022-09 | Github |
GPT-J | 6B | Decoder | ckpt | 2022-09 | Github |
GPT-Neo | 2.7B | Decoder | ckpt | 2021-03 | Github |
GPT-Neo | 1.3B | Decoder | ckpt | 2021-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Flan-PaLM | 540B | Decoder | - | 2022-10 | Paper |
BLOOMZ | 176B | Decoder | ckpt | 2022-11 | Paper |
InstructGPT | 175B | Decoder | api | 2022-03 | Paper |
Galactica | 120B | Decoder | ckpt | 2022-11 | Paper |
OpenChatKit | 20B | - | ckpt | 2023-3 | - |
Flan-UL2 | 20B | Decoder | ckpt | 2023-03 | Blog |
Gopher | - | - | - | - | - |
Chinchilla | - | - | - | - | - |
Flan-T5 | 11B | Encoder-Decoder | ckpt | 2022-10 | Paper |
T0 | 11B | Encoder-Decoder | ckpt | 2021-10 | Paper |
Alpaca | 7B | Decoder | demo | 2023-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
GPT 4 | - | - | - | 2023-03 | Blog |
ChatGPT | - | Decoder | demo|api | 2022-11 | Blog |
Sparrow | 70B | - | - | 2022-09 | Paper |
Claude | - | - | demo|api | 2023-03 | Blog |
-
LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research
- BELLE - Be Everyone's Large Language model Engine
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
-
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
-
T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
-
OPT - Open Pre-trained Transformer Language Models.
-
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
-
GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B - ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于 General Language Model (GLM) 架构,具有 62 亿参数.
- ChatGLM2-6B -开源中英双语对话模型 ChatGLM-6B 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM2-6B 引入了更长的上下文、更好的性能和更高效的推理.
-
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
-
StableLM - Stability AI Language Models.
-
YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
-
GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
-
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
-
Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
-
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
-
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
-
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
-
Palmyra - Palmyra Base was primarily pre-trained with English text.
-
Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
-
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
-
MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.
-
Open-Assistant - a project meant to give everyone access to a great chat based large language model.
- HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
- Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
-
- Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)
Model | #Author | #Link | #Parameter | Base Model | #Layer | #Encoder | #Decoder | #Pretrain Tokens | #IFT Sample | RLHF |
---|---|---|---|---|---|---|---|---|---|---|
GPT3-Ada | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 0.35B | - | 24 | - | 24 | - | - | - |
Pythia-1B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-1b | 1B | - | 16 | - | 16 | 300B tokens | - | - |
GPT3-Babbage | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 1.3B | - | 24 | - | 24 | - | - | - |
GPT2-XL | radford2019language | https://huggingface.co/gpt2-xl | 1.5B | - | 48 | - | 48 | 40B tokens | - | - |
BLOOM-1b7 | scao2022bloom | https://huggingface.co/bigscience/bloom-1b7 | 1.7B | - | 24 | - | 24 | 350B tokens | - | - |
BLOOMZ-1b7 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-1b7 | 1.7B | BLOOM-1b7 | 24 | - | 24 | - | 8.39B tokens | - |
Dolly-v2-3b | 2023dolly | https://huggingface.co/databricks/dolly-v2-3b | 2.8B | Pythia-2.8B | 32 | - | 32 | - | 15K | - |
Pythia-2.8B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-2.8b | 2.8B | - | 32 | - | 32 | 300B tokens | - | - |
BLOOM-3b | scao2022bloom | https://huggingface.co/bigscience/bloom-3b | 3B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-3b | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-3b | 3B | BLOOM-3b | 30 | - | 30 | - | 8.39B tokens | - |
StableLM-Base-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-3b | 3B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b | 3B | StableLM-Base-Alpha-3B | 16 | - | 16 | - | 632K | - |
ChatGLM-6B | zeng2023glm-130b,du2022glm | https://huggingface.co/THUDM/chatglm-6b | 6B | - | 28 | 28 | 28 | 1T tokens | \checkmark | \checkmark |
DoctorGLM | xiong2023doctorglm | https://github.com/xionghonglin/DoctorGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 6.38M | - |
ChatGLM-Med | ChatGLM-Med | https://github.com/SCIR-HI/Med-ChatGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 8K | - |
GPT3-Curie | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 6.7B | - | 32 | - | 32 | - | - | - |
MPT-7B-Chat | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-chat | 6.7B | MPT-7B | 32 | - | 32 | - | 360K | - |
MPT-7B-Instruct | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-instruct | 6.7B | MPT-7B | 32 | - | 32 | - | 59.3K | - |
MPT-7B-StoryWriter-65k+ | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-storywriter | 6.7B | MPT-7B | 32 | - | 32 | - | \checkmark | - |
Dolly-v2-7b | 2023dolly | https://huggingface.co/databricks/dolly-v2-7b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 15K | - |
h2ogpt-oig-oasst1-512-6.9b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 398K | - |
Pythia-6.9B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-6.9b | 6.9B | - | 32 | - | 32 | 300B tokens | - | - |
Alpaca-7B | alpaca | https://huggingface.co/tatsu-lab/alpaca-7b-wdiff | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Alpaca-LoRA-7B | 2023alpacalora | https://huggingface.co/tloen/alpaca-lora-7b | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Baize-7B | xu2023baize | https://huggingface.co/project-baize/baize-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 263K | - |
Baize Healthcare-7B | xu2023baize | https://huggingface.co/project-baize/baize-healthcare-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 201K | - |
ChatDoctor | yunxiang2023chatdoctor | https://github.com/Kent0n-Li/ChatDoctor | 7B | LLaMA-7B | 32 | - | 32 | - | 167K | - |
HuaTuo | wang2023huatuo | https://github.com/scir-hi/huatuo-llama-med-chinese | 7B | LLaMA-7B | 32 | - | 32 | - | 8K | - |
Koala-7B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 7B | LLaMA-7B | 32 | - | 32 | - | 472K | - |
LLaMA-7B | touvron2023llama | https://huggingface.co/decapoda-research/llama-7b-hf | 7B | - | 32 | - | 32 | 1T tokens | - | - |
Luotuo-lora-7b-0.3 | luotuo | https://huggingface.co/silk-road/luotuo-lora-7b-0.3 | 7B | LLaMA-7B | 32 | - | 32 | - | 152K | - |
StableLM-Base-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-7b | 7B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b | 7B | StableLM-Base-Alpha-7B | 16 | - | 16 | - | 632K | - |
Vicuna-7b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 7B | LLaMA-7B | 32 | - | 32 | - | 70K | - |
BELLE-7B-0.2M /0.6M /1M /2M | belle2023exploring | https://huggingface.co/BelleGroup/BELLE-7B-2M | 7.1B | Bloomz-7b1-mt | 30 | - | 30 | - | 0.2M/0.6M/1M/2M | - |
BLOOM-7b1 | scao2022bloom | https://huggingface.co/bigscience/bloom-7b1 | 7.1B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-7b1 /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-7b1-p3 | 7.1B | BLOOM-7b1 | 30 | - | 30 | - | 4.19B tokens | - |
Dolly-v2-12b | 2023dolly | https://huggingface.co/databricks/dolly-v2-12b | 12B | Pythia-12B | 36 | - | 36 | - | 15K | - |
h2ogpt-oasst1-512-12b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b | 12B | Pythia-12B | 36 | - | 36 | - | 94.6K | - |
Open-Assistant-SFT-4-12B | 2023openassistant | https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 | 12B | Pythia-12B-deduped | 36 | - | 36 | - | 161K | - |
Pythia-12B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-12b | 12B | - | 36 | - | 36 | 300B tokens | - | - |
Baize-13B | xu2023baize | https://huggingface.co/project-baize/baize-lora-13B | 13B | LLaMA-13B | 40 | - | 40 | - | 263K | - |
Koala-13B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 13B | LLaMA-13B | 40 | - | 40 | - | 472K | - |
LLaMA-13B | touvron2023llama | https://huggingface.co/decapoda-research/llama-13b-hf | 13B | - | 40 | - | 40 | 1T tokens | - | - |
StableVicuna-13B | 2023StableLM | https://huggingface.co/CarperAI/stable-vicuna-13b-delta | 13B | Vicuna-13B v0 | 40 | - | 40 | - | 613K | \checkmark |
Vicuna-13b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 13B | LLaMA-13B | 40 | - | 40 | - | 70K | - |
moss-moon-003-sft | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.1M | - |
moss-moon-003-sft-plugin | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft-plugin | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.4M | - |
GPT-NeoX-20B | gptneox | https://huggingface.co/EleutherAI/gpt-neox-20b | 20B | - | 44 | - | 44 | 825GB | - | - |
h2ogpt-oasst1-512-20b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b | 20B | GPT-NeoX-20B | 44 | - | 44 | - | 94.6K | - |
Baize-30B | xu2023baize | https://huggingface.co/project-baize/baize-lora-30B | 33B | LLaMA-30B | 60 | - | 60 | - | 263K | - |
LLaMA-30B | touvron2023llama | https://huggingface.co/decapoda-research/llama-30b-hf | 33B | - | 60 | - | 60 | 1.4T tokens | - | - |
LLaMA-65B | touvron2023llama | https://huggingface.co/decapoda-research/llama-65b-hf | 65B | - | 80 | - | 80 | 1.4T tokens | - | - |
GPT3-Davinci | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 175B | - | 96 | - | 96 | 300B tokens | - | - |
BLOOM | scao2022bloom | https://huggingface.co/bigscience/bloom | 176B | - | 70 | - | 70 | 366B tokens | - | - |
BLOOMZ /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-p3 | 176B | BLOOM | 70 | - | 70 | - | 2.09B tokens | - |
ChatGPT~(2023.05.01) | openaichatgpt | https://platform.openai.com/docs/models/gpt-3-5 | - | GPT-3.5 | - | - | - | - | \checkmark | \checkmark |
GPT-4~(2023.05.01) | openai2023gpt4 | https://platform.openai.com/docs/models/gpt-4 | - | - | - | - | - | - | \checkmark | \checkmark |
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Awesome LLM - A curated list of papers about large language models.
- Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
- awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
- Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
- Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
- Instruction-Tuning-Papers - A trend starts from
Natrural-Instruction
(ACL 2022),FLAN
(ICLR 2022) andT0
(ICLR 2022). - LLM Reading List - A paper & resource list of large language models.
- Reasoning using Language Models - Collection of papers and resources on Reasoning using Language Models.
- Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
- Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
- Awesome GPT-3 - a collection of demos and articles about the OpenAI GPT-3 API.
本项目遵循 MIT License.
本项目遵循 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
如果我们的项目对您有帮助,请引用我们的项目。
@misc{junwang2023,
author = {Jun Wang},
title = {Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models Evaluation},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/onejune2018/Awesome-LLM-Eval}},
}
作者简介:
- SFE平台算法研发负责
- 医药发现大模型MPG(Molecular Pretraining GraphTransformer)作者
- 国际竞赛SemEval2022习语判别任务第一、MIT AI-Cure第一、VQA2021第一、TREC2021第一、EAD2019第一
- 主页:https://onejune2018.github.io/homepage/