instruct-eval: A Python repository from henryqin1997

🐫 🍮 📚 InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

📣 Red-Eval, the benchmark for Safety Evaluation of LLMs has been added: Red-Eval

📣 Introducing Red-Eval to evaluate the safety of the LLMs using several jailbreaking prompts. With Red-Eval one could jailbreak/red-team GPT-4 with a 65.1% attack success rate and ChatGPT could be jailbroken 73% of the time as measured on DangerousQA and HarmfulQA benchmarks. More details are here: Code and Paper.

📣 We developed Flacuna by fine-tuning Vicuna-13B on the Flan collection. Flacuna is better than Vicuna at problem-solving. Access the model here https://huggingface.co/declare-lab/flacuna-13b-v1.0.

📣 The InstructEval benchmark and leaderboard have been released.

📣 The paper reporting Instruction Tuned LLMs on the InstructEval benchmark suite has been released on Arxiv. Read it here: https://arxiv.org/pdf/2306.04757.pdf

📣 We are releasing IMPACT, a dataset for evaluating the writing capability of LLMs in four aspects: Informative, Professional, Argumentative, and Creative. Download it from Huggingface: https://huggingface.co/datasets/declare-lab/InstructEvalImpact.

📣 FLAN-T5 is also useful in text-to-audio generation. Find our work at https://github.com/declare-lab/tango if you are interested.

This repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks. We aim to facilitate simple and convenient benchmarking across multiple tasks and models.

Why?

Instruction-tuned models such as Flan-T5 and Alpaca represent an exciting direction to approximate the performance of large language models (LLMs) like ChatGPT at lower cost. However, it is challenging to compare the performance of different models qualitatively. To evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academic benchmarks such as MMLU and BBH. Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers 🤗 (check here for a list of models we support):

AutoModelForCausalLM ( eg GPT-2, GPT-J , OPT-IML, BLOOMZ)
AutoModelForSeq2SeqLM ( eg Flan-T5, Flan-UL2 , TK-Instruct)
LlamaForCausalLM ( eg LLaMA , Alpaca, Vicuna)
ChatGLM

Results

For detailed results, please go to our leaderboard

Model Name	Model Path	Paper	Size	MMLU	BBH	DROP	HumanEval
	GPT-4	Link	?	86.4		80.9	67.0
	ChatGPT	Link	?	70.0		64.1	48.1
seq_to_seq	google/flan-t5-xxl	Link	11B	54.5	43.9
seq_to_seq	google/flan-t5-xl	Link	3B	49.2	40.2	56.3
llama	eachadea/vicuna-13b	Link	13B	49.7	37.1	32.9	15.2
llama	decapoda-research/llama-13b-hf	Link	13B	46.2	37.1	35.3	13.4
seq_to_seq	declare-lab/flan-alpaca-gpt4-xl	Link	3B	45.6	34.8
llama	TheBloke/koala-13B-HF	Link	13B	44.6	34.6	28.3	11.0
llama	chavinlo/alpaca-native	Link	7B	41.6	33.3	26.3	10.3
llama	TheBloke/wizardLM-7B-HF	Link	7B	36.4	32.9		15.2
chatglm	THUDM/chatglm-6b	Link	6B	36.1	31.3	44.2	3.1
llama	decapoda-research/llama-7b-hf	Link	7B	35.2	30.9	27.6	10.3
llama	wombat-7b-gpt4-delta	Link	7B	33.0	32.4		7.9
seq_to_seq	bigscience/mt0-xl	Link	3B	30.4
causal	facebook/opt-iml-max-1.3b	Link	1B	27.5			1.8
causal	OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5	Link	12B	27.0	30.0		9.1
causal	stabilityai/stablelm-base-alpha-7b	Link	7B	26.2			1.8
causal	databricks/dolly-v2-12b	Link	12B	25.7			7.9
causal	Salesforce/codegen-6B-mono	Link	6B				27.4
causal	togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1	Link	7B	38.1	31.3	24.7	5.5

Example Usage

Evaluate on Massive Multitask Language Understanding (MMLU) which includes exam questions from 57 tasks such as mathematics, history, law, and medicine. We use 5-shot direct prompting and measure the exact-match score.

python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native
# 0.4163936761145136

python main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.49252243270189433

Evaluate on Big Bench Hard (BBH) which includes 23 challenging tasks for which PaLM (540B) performs below an average human rater. We use 3-shot direct prompting and measure the exact-match score.

python main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit
# 0.3468942926723247

Evaluate on DROP which is a math question answering benchmark. We use 3-shot direct prompting and measure the exact-match score.

python main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.5632458233890215

Evaluate on HumanEval which includes 164 coding questions in python. We use 0-shot direct prompting and measure the pass@1 score.

python main.py humaneval  --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit
# {'pass@1': 0.1524390243902439}

Setup

Install dependencies and download data.

conda create -n instruct-eval python=3.8 -y
conda activate instruct-eval
pip install -r requirements.txt
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu

henryqin1997/instruct-eval

🐫 🍮 📚 InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Why?

Results

Example Usage

Setup