
Advancing LLM with Diverse Coding Capabilities

Primary LanguagePythonMIT LicenseMIT

WaveCoder: Widespread And Versatile Enhanced Code LLM

[📜 Paper][🤗 HF Models][🐱 GitHub]
[🐦 Twitter][💬 Reddit][🍀 Unofficial Blog]

Repo for "WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation" [ACL 2024 Main]

Figure 1: WaveCoder models pipeline.

🔥 News

  • [2024/05/16] WaveCoder paper is accepted by main conference of ACL 2024.
  • [2024/04/10] 🔥🔥🔥 WaveCoder repo, models released at 🤗 HuggingFace!
  • [2023/12/26] WaveCoder paper released.

💡 Introduction

WaveCoder 🌊 is a series of large language models (LLMs) for the coding domain, designed to solve relevant problems in the field of code through instruction-following learning. Its training dataset was generated from a subset of code-search-net data using a generator-discriminator framework based on LLMs that we proposed, covering four general code-related tasks: code generation, code summary, code translation, and code repair.

Model HumanEval MBPP(500) HumanEval
GPT-4 85.4 - 47.8 52.1
WaveCoder-DS-6.7B 65.8 63.0 49.5 40.8
WaveCoder WaveCoder-Pro-6.7B 74. 4 63.4 52.1 43.0
WaveCoder WaveCoder-Ultra-6.7B 79.9 64.6 52.3 45.7

LLM-based Generator-Discriminator

Figure 2: Main framwork of LLM-based Generator-Discriminator.

Example of Instruction Generation

Figure 3: An Example of Our Data Generation.

Data Decontamination

We combine our dataset with the decontaminated evol-codealpaca-v1 dataset (WaveCoder-evol-instruct) to train WaveCoder-Ultra-6.7B.

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. Run the following commands to setup your environment:

conda create -n wavecoder python=3.9
conda activate wavecoder
cd src
pip install -r requirements.txt
pip install transformers==4.34.1
pip install flash-attn==2.5.5

⚡️ Training

We also open-source our complete training scripts for the community, and you may construct your own dataset for training. Our training scripts refer to Fastchat

To train a model, run the following command:

cd src
bash script/train.sh

⚖️ Evaluation

  • For HumanEval benchmark, we use the code base from Evalplus. We recommend using the code base from Magicoder and the following command to reproduce the HumanEval result of WaveCoder.

SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
SANITIZED_PATH=humaneval_result/evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl

python -m experiments.text2code \
  --model_key $MODEL_KEY \
  --model_name_or_path $MODEL \
  --save_path $SAVE_PATH \
  --dataset $DATASET \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 512 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1

echo "$MODEL"
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
  • For MBPP (500), you can get generations by running the following command:
cd src
bash script/generate.sh

and then get a pass_k score and the error type analysis by running the following command:

bash script/evaluate.sh

🌲 Data Generation

Firstly, you should prepare your raw code data and save it as .jsonl file, then you can run the following command:

cd src
bash script/coreset.sh

to get the coreset of you raw data. Once you get the coreset, you can run

cd src
bash script/data_generate.sh

to launch the LLM-based Generator-Discriminator framework. You can customize your data by controlling the prompt and the configurations in the above .sh script.

📖 License

This code repository is licensed under the MIT License. The use of DeepSeek Coder models is subject to the its License.

☕️ Citation

If you find this repository helpful, please consider citing our paper:

  title={Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation},
  author={Yu, Zhaojian and Zhang, Xin and Shang, Ning and Huang, Yangyu and Xu, Can and Zhao, Yishujie and Hu, Wenxiang and Yin, Qiufeng},
  journal={arXiv preprint arXiv:2312.14187},

🍀 Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.


✨ Star History

Star History Chart