This repository provides an overview of all components from the paper OctoPack: Instruction Tuning Code Large Language Models.
Data | CommitPack | 4TB of GitHub commits across 350 programming languages |
---|---|---|
CommitPackFT | Filtered version of CommitPack for high-quality commit messages that resemble instructions | |
Model | OctoCoder | StarCoder (16B parameters) instruction tuned on CommitPackFT + OASST |
OctoGeeX | CodeGeeX2 (6B parameters) instruction tuned on CommitPackFT + OASST | |
Evaluation | HumanEvalPack | Extension of OpenAI's HumanEval to cover 3 scenarios across 6 languages |
CommitPack is uploaded here. To recreate:
- BigQuery SQL: Use BigQuery to select the commit data from the GitHub action data. All SQL commands can be found in
dataset/commitpack/sql
. They are executed in order starting with the first one to to the fifth one. They are separated and executed one-by-one as BigQuery was raisingResources exceeded
errors during query execution when running all in a single statement. After each SQL query a dataset is created and named as indicated in the filename. E.g. after executingsql_1_commits_table_base.sql
, you would name the output datasetcommits_table_base
, which is then referenced in the 2nd statement. - Export: From BigQuery export the dataset after the final SQL statement inside GCP to a bucket as parquet files.
- Upload to HF: Use a GCP compute instance to copy all the parquet files into a Hugging Face dataset and push it. The resulting dataset contains metadata on the commits, CommitPackMeta
- Scrape GitHub: Run the script at
dataset/commitpack/scrape_github.py
to download the files prior and after each git commit from GitHub. It contains some basic filters to remove noise files (relying on the extensions file atdataset/commitpack/programming_languages.json
) and then uses multi-threading and multi-processing for scraping. It is recommended to run it on a very large instance. - Shard (optional): Depending on the size of your files, you may want to shard them at this point using the script at
dataset/commitpack/shard.sh
- Opt-out & languages: Run the script at
dataset/commitpack/licenses_langs.py
to remove repositories from users who opted out of the step (first part with__main__
, needs to be uncommented) and split the large files from the prior step into files for each programming language (second part with__main__
, currently uncommented). You will likely have to change some of the path names and uncomment parts as necessary - Shard (optional): Using the script at
dataset/commitpack/shard.py
you can shard the large jsonl files for each language into smaller chunks with a specified size limit.
CommitPackFT is uploaded here. To recreate:
- Prepare: Download CommitPack via e.g.
git clone bigcode/commitpack
or follow all the steps above to recreate it. - Filter: Run
python dataset/commitpackft/commitpackft_filters1.py
followed bypython dataset/commitpackft/commitpackft_filters2.py
. You may want to modify some of the global variables defined in the scripts.
- StarCoder Self-Instruct: Uploaded here, to recreate see this repository.
- xP3x: Uploaded here, to recreate see the script at
dataset/xp3x/filter_xp3x.py
. - OASST: Uploaded here, to recreate see the script at
dataset/oasst/filter_oasst.py
. Each line in the jsonl file is a conversation tree. We only keep the first two messages of each conversation tree, which are the question and answer.
- Setup: Run the below bash code to setup the evaluation repository. If you want the repository in exactly the state we used it for the paper you can add the flag
-b octopack
to clone the branch we used for the paper. Generally, we recommend using the latest version of the code.
git clone https://github.com/bigcode-project/bigcode-evaluation-harness
# If you want the exact paper branch: git clone -b octopack https://github.com/bigcode-project/bigcode-evaluation-harness
cd bigcode-evaluation-harness
pip install -q -r requirements.txt
accelerate config
- Run: You can then run a task via e.g.
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalfixtests-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalfixpython_octocoder.json \
--metric_output_path evaluation_humanevalfixpython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
Notes:
accelerate
: You can also directly usepython main.py
. Accelerate has the advantage of automatically handling mixed precision & devices.prompt
: This defines the prompt. Example values areoctocoder
,octogeex
,wizardcoder
,instructcodet5p
,starchat
which use the prompting format that is put forth by the respective model creators. You can refer to the actual evaluation file for how the prompt looks like.allow_code_execution
: This will directly execute the evaluation and save results on your current machine. If you only want to create the generations and evaluate them later, you can add the flag--generation_only
and then evaluate them using e.g. the Colab notebook we provide in the next section. This is practical for languages you may not have installed on your machine, such as Rust.tasks
: For HumanEvalPack, the tasks are the following:'humanevalfixdocs-cpp', 'humanevalfixdocs-go', 'humanevalfixdocs-java', 'humanevalfixdocs-js', 'humanevalfixdocs-python', 'humanevalfixdocs-rust', 'humanevalfixtests-cpp', 'humanevalfixtests-go', 'humanevalfixtests-java', 'humanevalfixtests-js', 'humanevalfixtests-python', 'humanevalfixtests-rust', 'humanevalexplaindescribe-cpp', 'humanevalexplaindescribe-go', 'humanevalexplaindescribe-java', 'humanevalexplaindescribe-js', 'humanevalexplaindescribe-python', 'humanevalexplaindescribe-rust', 'humanevalexplainsynthesize-cpp', 'humanevalexplainsynthesize-go', 'humanevalexplainsynthesize-java', 'humanevalexplainsynthesize-js', 'humanevalexplainsynthesize-python', 'humanevalexplainsynthesize-rust', 'humanevalsynthesize-cpp', 'humanevalsynthesize-go', 'humanevalsynthesize-java', 'humanevalsynthesize-js', 'humanevalsynthesize-python', 'humanevalsynthesize-rust'
.- HumanEvalFix is divided into two parts: One where only tests are provided and no docstrings (main focus of the paper) and one where instead of tests docstrings are provided as the source of truth (appendix).
- HumanEvalExplain consists of describing first and then synthesizing given the descriptions. You need to run these tasks sequentially. For the describing you can activate
--generation_only
as there is no evaluation yet. For the synthesizing part, you need to provide the descriptions via--load_data_path
, which will then be used to synthesize answers.n_samples
is set to 1 for synthesis as we generate 1 answer for each description (multiple samples have already been generated for the descriptions vian_samples
). See below for an example:
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalexplaindescribe-python \
--generation_only \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalexplaindescribepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalexplainsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 1 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--load_data_path generations_humanevalexplaindescribepython_octocoder.json \
--save_generations_path generations_humanevalexplainsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalexplainpython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
- HumanEvalSynthesize is an extension of HumanEval. If you would like to run with the original HumanEval prompt that relies on pure function continuation you can use the flag
--prompt continue
. OctoCoder uses--prompt octocoder
as shown in the below script. The below script should reproduce the pass@1 HumanEval score of 46.2% for OctoCoder:
accelerate launch main.py \
--model bigcode/octocoder \
--tasks humanevalsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalsynthesizepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
- Unfortunately, there is some randomness depending on the Python version you use for evaluation and the
batch_size
. We usebatch_size=5
and Python 3.9.13 - We provide the exact scripts we used in
evaluation/run/eval_scripts
for each model. There is also a_range.sh
script for each task (e.g.evaluation/run/eval_scripts/eval_humanevalfix_range.sh
), which runs each sample individually. This is much faster if you have multiple GPUs available. In the_range.sh
scripts you need to specify the model and language you would like to run. After running it, you will have 164 generation files, which you need to merge withpython evaluation/run/merge_generations.py "generations_*json"
. Subsequently, you need to run the evaluation as explained in the next step.
- Evaluate: If you have only created generations without evaluating them (e.g. by adding the
--generation_only
flag or using the_range.sh
scripts), you can use the notebook atevaluation/run/humanevalpack_evaluation
or this colab to evaluate the generations. It contains a section for each programming language where it installs the language first and then given the path to your generations evaluates them provides you with the pass@k scores. We use the below versions for the evaluation of each language:
- Python:
Python 3.9.13 torch 1.13.0+rocm5.2 accelerate 0.20.3 transformers 4.32.1
(for running & evaluating) - C++:
11.4.0
(but newer ones should be fine too) - JS:
js-md5@0.7.3
- Java:
java version "18" 2022-03-22
- Go:
go1.18.4
- Rust:
rustc 1.71.1 (eb26296b5 2023-08-03)
To create HumanEvalPack, we follow these steps:
- We use the upper commented out part of the script at
evaluation/create/prepare_humaneval.py
to create a JSON with the solution for each humaneval language inevaluation/create/humaneval-x/data
. - We then manually go through each JSON file (e.g.
evaluation/create/humaneval-x/data/cpp/data/humanevalpack.json
) to introduce a bug across all languages in parallel. - We also make several fixes to the humaneval-x dataset, all of which are documented at the top of
evaluation/create/humaneval-x/README.md
. - We run the lower part of
evaluation/create/prepare_humaneval.py
to turn the JSON files back into JSONL files with the buggy solution, an instruction column and some other metadata. These JSONL files located at e.g.evaluation/create/humaneval-x/data/cpp/data/humanevalpack.jsonl
are then uploaded into the HF dataset at https://huggingface.co/datasets/bigcode/humanevalpack.
The finetuning script to create OctoCoder is at finetuning/starcoder/finetune.py
. The folder contains a README.md
with instructions.
OctoGeeX is finetuned based on CodeGeeX2-6B using an internal training framework. The hyperparameters are as follows:
Parameter | Value |
---|---|
tp_size |
2 |
global_batch_size |
48 |
lr |
5e-5 |
train_step |
50 |
seq_length |
8192 |
precision |
bf16 |
It is also compatible with finetuning/finetune.py
.
The finetuning script for santacoder is at finetuning/santacoder/finetune.py
. The default hyperparameters are set for the line diff
format, as described in the Appendix H.
- Obtain Megatron-LM by executing
git clone https://github.com/bigcode-project/Megatron-LM
. - Download the dataset: Download a pretraining dataset (commitpack-subset-cf) using the
git clone https://huggingface.co/datasets/bigcode/commitpack-subset-cf
, and merge all jsonl files into one jsonl file. You can name it as you prefer, such ascommitpack_cf.jsonl
. - Move the files
training/preprocess_santacoderpack.sh
andtraining/pretraining_santacoderpack.sh
to theMegatron-LM
directory. - Tokenize the pretraining dataset by modifying
preprocess_santacoderpack.sh
to point to your jsonl file. Also, change the path of the tokenizer to point to StarCoder'stokenizer.json
by usingwget https://huggingface.co/bigcode/starcoderbase/raw/main/tokenizer.json
. Finally, specify an output prefix where the tokenized data will be stored, and run the script usingbash preprocess_santacoderpack.sh
. - Modify
pretraining_santacoderpack.sh
to adjust theCHECKPOINT_PATH
so that it points to the saved Megatron-LM checkpoint, and set theTOKENIZER_FILE
to StarCoder'stokenizer.json
. Make sure to point to the correct environment and cache locations, and alter any custom settings to fit your setup. Run the script by executingbash pretraining_santacoderpack.sh
! - Convert the saved checkpoint using the script located at
convert_large.sh
. It contains instructions which repos to download.
We did not end up using Megatron-LM fine-tuning for the model in the paper, but implemented it nevertheless. Feel free to follow these instructions to use it:
- Get the StarCoderBase Megatron-LM checkpoint:
git clone https://huggingface.co/bigcode/starcoderbase-megatron
- Get Megatron-LM:
git clone -b mtf https://github.com/bigcode-project/Megatron-LM
- Prepare a Python environment with PyTorch. (TODO: There may be some other packages needed that you will find out about when training fails)
- Prepare dataset: Preapre a finetuning dataset in the form of a single jsonl file with two keys:
inputs
&outputs
.inputs
should contain the prompt and instruction whileoutputs
contains the targets. Loss will only be computed overoutputs
. Seedataset/commits_to_jsonl.py
for an example of doing this. In that example we put the instruction (commit message) in the target, but it's better to put it in the input. - Tokenize the fine-tuning dataset by modifying
dataset/preprocess.sh
to point to your jsonl dataset. Also modify the path of the tokenizer, in our case point to the StarCoder'stokenizer.json
(wget https://huggingface.co/bigcode/starcoderbase/raw/main/tokenizer.json
). Finally specify an output prefix where the tokenized data will be stored. Then run it withbash dataset/preprocess.sh
. - Create two files
train_data_paths.txt.tmp
andvalid_data_paths.txt.tmp
that contain the paths to the above created tokenized dataset. For example they could look like"train: 1.0 0:0.95 output_prefix"
and"valid: 1.0 0.95:1.0 output_prefix
. In this case the dataset is split into 95% training and 5% validation. The first number is the weight of the dataset, the second number is the start of the dataset and the third number is the end of the dataset. - Rename the checkpoint downloaded to
release
i.e.mv starcoderbase-megatron/iter* starcoderbase-megatron/release
and create a filestarcoderbase-megatron/latest_checkpointed_iteration.txt
that contains simplyrelease
(echo release > starcoderbase-megatron/latest_checkpointed_iteration.txt
). - Modify
training/finetune_starcoderbase.sh
to adaptCHECKPOINT_PATH
to point to the downloaded Megatron-LM checkpoint,WEIGHTS_TRAIN
&WEIGHTS_VALID
to point to the above created txt files,TOKENIZER_FILE
to StarCoder'stokenizer.json
, point to your environment and cache locations, and modify the SBATCH settings to suit your setup. Then run it withbash training/finetune_starcoderbase.sh
. You can interrupt and resume training, however, if you resume, you need to remove--no_load_optim
and--no_load_rng
from the command line arguments in the script to load the optimizer and random number generator state from the newly saved checkpoint (we only do not want to load them from starcoderbase). - Convert the saved checkpoint using the script at
convert_large.sh
. It contains instructions which repos to download.
Figures:
- Figure 1:
visuals/main.pdf
, create the main plot invisuals/plots.ipynb
or via this colab and then add it to the correct tab invisuals/visuals.drawio
which can be opened with drawio - Figure 2 (Upper):
visuals/distribution.pdf
, create viavisuals/plots.ipynb
or colab - Figure 2 (Lower):
visuals/tasks.pdf
, create viavisuals/distribution_tasks.py
- Figure 3:
visuals/humanevalpack.pdf
, create viavisuals/visuals.drawio
which can be opened with drawio - Figure 4:
visuals/ablations.pdf
, create viavisuals/plots.ipynb
or this colab - Other Figures: Manual
Tables:
- Table 4: Create via
visual/distribution_languages.py
- Other Tables: Manual
Everything is licensed as permissively as possible to us.
CommitPack, CommitPackFT, HumanEvalPack, and all code are licensed under the MIT License of this repository. Note that each sample within CommitPack and CommitPackFT has its own license corresponding to the repository it stems from as indicated by the license
field. All samples stem from permissively licensed repositories. You can check the paper appendix for the licenses we filtered for.
OctoCoder is licensed under the same license as StarCoder (Commercial except for use cases deemed harmful).
OctoGeeX is licensed under the same license as CodeGeeX2 (Commercial but a form needs to be submitted).
@article{muennighoff2023octopack,
title={OctoPack: Instruction Tuning Code Large Language Models},
author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre},
journal={arXiv preprint arXiv:2308.07124},
year={2023}
}