/Self-instruct

A repository to perform self-instruct with a model on HF Hub

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Self-instruct 🤗

A repository to perform self-instruct with a model on Hugging Face Hub.

What is this about?

This repository is dedicated to Self-instruct. It is an iterative approach which allows to generate a dataset of instructions by boostrapping on a model's prediction. For it to work well, the model used has to be powerful. The original work actually focuses on OpenAI's text-davinci-003 engine which is one of their most powerful model. Our aim is to give a chance to modest, decoder-based models to be used for a data generation purpose.

News

  • September 6, 2023: Get ready to welcome self-instruct data from Code Llama.
  • May 24, 2023: We've built a space which allow to visualize the data generated by self-instruct when the model used is StarCoder💫, the recent SOTA open-source code LLM by Hugging Face 🤗.

Disclaimer

  • Our approach requires the availability of a good amount of computational resources/ an inference endpoint.
  • We will focus on the dataset generation pipeline and the curation rather than the fine-tuning.
  • Keep in mind that the quality of the dataset obtained by this method is strongly dependent on the quality of the model that is used.

Table of Contents

  1. Overview of the method
  2. Related work
  3. Our approach
  4. Quickstart
  5. Fine-tuning
  6. Acknowledgements

Overview

Self-instruct is an iterative method that helps LM improve their ability to follow natural language instructions. The idea is to use a seed set of manually-written instructions and use them to prompt the model to generate new instructions and their corresponding input-output instances. The method includes a filtering step to ensure the novelty of the generated task.

Related work

Our implementation is inspired by the original Self-instruct method and recent updates including Stanford's alpaca and Code alpaca. While the last two are almost identical, with the sole difference being the set of seed tasks used, the original work has a different mindset. As a matter of fact, self-instruct's author uses a set of seed tasks and prompt the model with some of them to make it generate instructions. Later on, the output to the generated instructions are found separately. Conversely, Alpaca is all in one in the sense that the model is prompted to generate an instruction as well as the input-output pair at the same time. It uses the following template

### Instruction:
{instruction}

### Input:
{input}

### Output:
{output}

The advantage is that this all in one template allows to reduce the inference cost of the method, and the quality of the generated instances is not proven to be significantly impaired. We believe, intuitively, that this prompting approach generates feasible instructions thanks to the obligation to have a sound input-output pair associated to it.

Our approach

Our approach is focused on code use cases, therefore our modifications are mostly relevant for that framework.

The prompting format

During our tests, we realized that, at least with "small" code models, the trigger words Input and Output tend to make them generate test cases instead. It is significantly impairing because given an instruction, we want a working implementation rather than a potentially buggy test case. In order to alleviate this issue, we decided to get rid of the Input trigger word. We adopt an instruction-output format.

The trigger words

Using Instruction, Input and Output seems to work well for text-davinci-003 but how well does it work for other models? This parameter is definitely relevant for small models as this can have a huge impact on the quality of their generations. Following this intuition, we included in our code the possibility to change the trigger words that are used during the prompting. This allows to accomodate to every single model.

The post-processing

How to select and post-process the instructions that are generated by prompting a model? In the original work, the instructions are generated iteratively, and we keep those with a rouge score stricly less than 0.7 with any previously generated instruction. This allows diversity in the dataset, at least in terms of how the instructions are worded. According to our experiments, it is still possible to generate a problem multiple times with a different formulation each time. We propose to extend take the curation further with multiple ideas.

Self-consistency

We came up with a strong data instruction filtering technique. The idea is very simple, we want to test if the model is consistent with what it generates. We verify that by prompting the model to generate and instruction based the output. It is a complicated task for a LM and for a human because in many cases, it results in an unsolvable task. In the case where the model is able to generate an instruction, we compare it in terms of meaning with the ground-truth. For that, we use Sentence-BERT, precisely All-MiniLM-L6-v2 with the threshold of our choice (typically 0.5). This filtering technique is not recommended for models with a frailty ability to understand natural language text.

Uniqueness

Another alternative is to post-process the raw dataset by only keeping instructions that are not similar to each other in terms of meaning. Once again we make use of Sentence-BERT. An instruction is kept if any previously generated instruction has a similarity score less than a threshold (typically 0.5) w.r.t the considered instruction.

Further details

We modified the seed tasks to keep only those who are related to code. For that we combine the tasks from Code Alpaca (code tasks extrated from the original seed tasks + some new tasks probably created by the repo's author) and some leetcode tasks. We have a total of 41 seed tasks.

Quickstart

StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's transformers library.

Step by step installation with conda

Here, we present a step by step recipe that anybody can use in order to apply our self-instruct method on its prefered LLM in a conda environment. Create a new conda environment and activate it

conda create -n env
conda activate env

Install the pytorch version compatible with your version of cuda here, for example the following command works with cuda 11.6

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

Install transformers and accelerate

conda install -c huggingface transformers 
pip install git+https://github.com/huggingface/accelerate

Do not forget to launch accelerate config in the terminal in order to configure you environment, for more the details see accelerate.

We will also need rouge-score

pip install rouge-score

Now we are ready to clone the repository and to start working

git clone https://github.com/ArmelRandy/self-instruct
cd Self-instruct

Instruction - output

Here we prompt the model with the following template

### Instruction :
{instruction}

### Output :
{output}

For the instructions that provides an input (a code in case of a debugging task or a translation task), we concatenate the instruction and the input under the keyword Instruction, we then have

### Instruction :
{instruction}
{input}

### Output :
{output}

The possibility to change the trigger words Instruction and Output into other words such as Request and Answer respectively for example is given. However, the change has to be done directly in the code, you'll need to define a template and add it in template.py.

accelerate launch main.py \
    --seed_tasks_path="data/code_tasks.jsonl" \
    --output_data_path data/output.jsonl \
    --num_instructions_to_generate 20000 \
    --template_name better \
    --format 2 \
    --model_name_or_path="bigcode/starcoderbase-1b" \
    --num_prompt_instructions 8 \
    --request_batch_size 8 \
    --num_prompt_synthetic_instructions 2 \
    --max_new_tokens 4096 \
    --temperature 0.8 \
    --top_p 0.95 \
    --num_beams 1 \
    --repetition_penalty 1.2 \
    --threshold 0.7 \
    --seed 42 \
    --keep_programming \

Instruction - input - output

It is the template as designed in Stanford's alpaca. The possibilty to change the trigger words is also provided, with the same limitations as those previously mentionned.

accelerate launch main.py \
    --seed_tasks_path="data/code_tasks.jsonl" \
    --output_data_path data/output.jsonl \
    --num_instructions_to_generate 20000 \
    --template_name better \
    --format 3 \
    --model_name_or_path="bigcode/starcoderbase-1b" \
    --num_prompt_instructions 8 \
    --request_batch_size 8 \
    --num_prompt_synthetic_instructions 2 \
    --max_new_tokens 4096 \
    --temperature 0.8 \
    --top_p 0.95 \
    --num_beams 1 \
    --repetition_penalty 1.2 \
    --threshold 0.7 \
    --seed 42 \
    --keep_programming \

Text-generation-inference support

It is possible to use TGI for the data generation if you have access to an inference endpoint. You'll need to set your hugging face token and the url of your endpoint in the environment variables HF_TOKEN and API_URL. In order to use TGI, you'll need to add --use_tgi to the above commands.

Post-processing

This part requires an additional requirement, that is sentence-transformers whose installation is as follows :

pip install -U sentence-transformers

Self-consistency

Here, we run the file processing.py with the help of accelerate

accelerate launch processing.py \
    --seed_tasks_path="data/code_tasks.jsonl" \
    --input_data_path data/output.jsonl \
    --output_data_path data/output_processed.jsonl \
    --template_name default \
    --model_name_or_path="bigcode/starcoderbase-1b" \
    --num_prompt_instructions 4 \
    --num_trials 1 \
    --max_new_tokens 512 \
    --temperature 0.2 \
    --top_p 0.95 \
    --num_beams 1 \
    --repetition_penalty 1.2 \
    --threshold 0.7 \
    --seed 42 \

Uniqueness

Here we want to apply a post-processing to our generated instructions by considering only instructions that are not too similar. In order to do so, we get into the folder self-instruct and we launch

cd post_processing
python unique_post_processing.py \
    --input_data_path ../data/output.jsonl \
    --output_data_path ../data/output_processed.jsonl \
    --threshold 0.5 \

Visualization and statistics

It is possible to visualize the instructions generated in terms of how they are phrased. Specifically we can show the most common used root verbs and their top 4 direct noun objects. This functionality is inherited from the implementation provided by self-instruct's author. Its usage requires additional libraries, spacy, benepar and plotly

pip install -U spacy
python -m spacy download en_core_web_md
pip install benepar 
pip install plotly

Now, it is possible to run the notebook instruction_visualize.ipynb. We also provide dataset_to_hub.ipynb in order to push the generated dataset to the hub.

Fine-Tuning

Now that the dataset is available, we can fine-tune our favorite text/code LLM to make it follow instructions. Our choice is naturally towards StarCoder. This repository gives a comprehensive method that can be used to fine-tune starcoder on any instruction dataset available on the hub. You can also check out Octopack's repository.

Acknowledgements