Listing all reported open-source LLMs achieving a higher score than proprietary, paying OpenAI models (ChatGPT, GPT-4).
OpenSource-LLMs-better-than-ChatGPT
Datasets
Evaluation datasets
Logical reasoning (maths, coding, etc)
Evaluating Large Models Trained on Code (HumanEval benchmark). Mark Chen et al, 2021. 164 hand-written programming problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.
Training Verifiers to Solve Math Word Problems (GSM8K benchmark). Karl Cobbe et al, 2021. 8.5K (7.5k training + 1k test) high quality grade school math problems created by human problem writers. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. Yushi Bai et al, 2023. Bilingual English/Chinese, multi-task benchmark for long context understanding. 21 datasets across 6 task categories, with an average length of 6,711 words (English) and 13,386 characters (Chinese). The tasks cover key long-context tasks such as single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion.
SummScreen: A Dataset for Abstractive Screenplay Summarization. Chen et al, ACL 2022. 22,503 episodes from TVMegaSite (SummScreen-TMS, split into 18,915/1,795/1,793 train/dev/test) and 4,021 episodes from ForeverDreaming (SummScreen-FD, split into 3,673/338/337 train/dev/test).
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers (Qasper). Dasigi et al, NAACL 2021. 5,049 questions (2,593/1,005/1,451 train/valid/test) over 1,585 NLP papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
QuALITY: Question Answering with Long Input Texts, Yes!. Pang et al, NAACL 2022. Multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens. 6,737 questions split into 2,523/2,086/2,128 train/dev/test.
SQuAD: 100,000+ Questions for Machine Comprehension of Text. Rajpurkar et al, EMNLP 2016. 100k+ questions asked by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
HellaSwag: Can a Machine Really Finish Your Sentence. Zellers et al, ACL 2019. Questions collected with Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers.
WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Sakaguchi et al, 2019. Large-scale dataset of 44k problems, inspired by the original Winograd Schema Challenge (WSC), a benchmark for commonsense reasoning made of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. 12,282 instances split into 9,248/1,267/1,767 train/dev/test sets.
SocialIQA: Commonsense Reasoning about Social Interactions (SIQA). Sap et al, EMNLP 2019. 38k (33,410/1,954/2,224 train/dev/test) multiple-choice commonsense questions along with correct and incorrect answers about social interactions collected through crowdsourcing.
Natural Questions: A Benchmark for Question Answering Research. Tom Kwiatkowski et al, TACL 2019. 307,373 training examples with single annotations; 7,830 development examples with 5-way annotations and 7,842 test examples with 5-way annotations. Questions are real anonymized, aggregated queries issued to the Google search engine. Each question is paired with an entire Wikipedia page.
TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al, ACL 2022. 817 questions spanning 38 categories. Question and answers are hand-written by human annotators and designed to elicit imitative falsehoods.
In the following, we report cases where an open-source LLM (e.g., Llama-2) outperforms an OpenAI, paying LLM (e.g., ChatGPT). To maintain conciseness, we follow the following reporting guidelines:
-Only report the highest performing version of the open-source LLM.
-Only report the highest performing version of the OpenAI model which is outperformed by the open-source LLM.
-Average results over all datasets where the open-source LLM is better than the OpenAI LLM. This implies excluding reported results on datasets where the proposed LLM underperforms all OpenAI LLMs.
We refer the reader to the respective papers for more details.
We split LLMs depending on the type of training performed:
-Pre-training refers to LLMs pre-trained from scratch.
-Continual pre-training refers to LLMs initialized from an already pre-trained LLM (e.g, Llama-2) and then undergoing another phase of pre-training.
-Instruction tuning are LLMs trained with supervised fine-tuning on instruction tuning datasets or standard downstream tasks datasets.
-Inference designates proposed techniques which drive LLM performance while not changing the model weights.
Note that a proposed LLM may fall into several of the above 4 categories. In that case, we place it into the most computationally intensive category: for instance, a paper proposing both to continue pre-training Llama-2 and to fine-tune on a new, instruction-tuning dataset will land in the Continual pre-training category.