Training Open Instruction-following Language Models

This is the repository for the paper How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources .

We explore instruction-tuning popular base models on publicly available datasets. This repository contains:

Training code used for training all models.
Evaluation code for the evaluation done in the paper.
Script for merging and creating model diffs.

As part of this work we introduce Tülu, a suite of LLaMa models fully-finetuned on a strong mix of datasets!

Tülu 65B is the strongest model we build and available here - see below for how to make use of this model yourself!

Setup

You can install the required packages by running the following command (after installing pytorch):

pip install -r requirements.txt

If you just want the dependencies for the weight diff script, use:

pip install -r weight-diff-requirements.txt

Model preparation

To get LLaMa checkpoints, please acquire them via Meta here and consult the huggingface documentation for converting them to a huggingface-compatible format.

Generally, most huggingface-compatible models should work fine, potentially with some adjusting for different tokenizers etc.

Weight Diff Script

We use a slightly modified form of the Alpaca weight diff script, which runs the same.

To merge a model:

Download the relevant LLaMa model and convert it to huggingface format (see above).
Download our repository and install the right dependencies (see above).
Download the model diff you want.
Run the command below:

python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}

Training

Dataset Preparation

To download and prepare the instruction datasets we explore, use:

./scripts/prepare_train_data.sh

Please check these datasets for licenses and restrictions around their use!

Finetuning

To run instruction tuning, you can use the following command:

./scripts/finetune_with_accelerate.sh

Adjust model_name_or_path, tokenizer_name, train_file, and output_dir to your models / data / setting. By default, this uses deepspeed with accelerate.

Model Checkpoints

We provide a number of model checkpoints as diffs. You can find them on huggingface here. They are also all here:

Model	7B	13B	30B	65B
SuperNI	link	link
CoT	link	link
Flan V2	link	link
Dolly	link	link
Open Assistant 1	link	link
ShareGPT	link	link	link	link
Self-instruct (original)	link	link
Unnatural Instructions	link	link
Alpaca	link	link
Code-Alpaca	link	link
GPT4-Alpaca	link	link
Baize	link	link
Human-Mix	link	link	link	link
Tulu	link	link	link	link

Pythia and OPT models (and more...?) coming soon!

Evaluation

First, run the following script to download all the evaluation datasets:

./scripts/prepare_eval_data.sh

Evaluation scripts for different datasets are put under ./scripts. For example, you can use the following command to run the MMLU evaluation script:

./scripts/eval/mmlu.sh

AlpacaFarm

To run AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:

python eval/alpaca_farm_eval.py --model <model> --batch_size 8

Please check the script for more details on the script itself!

Human Evaluation Interface

Coming soon!

Citation

If you used this repository or our models, please cite our work:

@misc{wang2023far,
      title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
      author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
      year={2023},
      eprint={2306.04751},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

lppllppl920/open-instruct