🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning.

🕸️ Project Website - To discover the reasoning for the name of KodCode 🤨
📄 Technical Report - Discover the methodology and technical details behind KodCode
💾 Github Repo - Access the complete pipeline used to produce KodCode V1
🤗 HF Datasets: KodCode-V1 (For RL); KodCode-V1-SFT-R1 (for SFT)

Overview

Features

KodCode is a comprehensive pipeline designed to generate diverse, challenging, and verifiable synthetic datasets for coding tasks. Key features include:

Diverse Sources: Generate high-quality coding questions from multiple sources including zero-shot generation, human-written assessment questions, code snippets, and technical documentation - all unified in a single framework!
Self-Verification: Generate verifiable solutions and tests for each coding question. Support pytest and parallel execution.
Style Converter: Easy to convert between different styles of coding questions.

Installation

Build environment

Conda Environment:

git clone https://github.com/KodCode-AI/kodcode.git
cd kodcode
conda create -n kodcode python=3.10 -y
conda activate kodcode
pip install -r requirements.txt

To run unit tests in parallel, you also need to install parallel. For example, if you are using Ubuntu, you can install parallel by:

sudo apt-get install parallel

Generate KodCode

Please refer to the pipeline for more details.

TODO

One-line command to generate KodCode
Integrate the test pipeline (i.e., pytest) into verl
Implement sandbox execution for unit tests
Filter KodCode-Small with 50K samples

🧐 Other Information

License: Please follow CC BY-NC 4.0.

Contact: Please contact Zhangchen by email.

📚 Citation

If you find the model, data, or code useful, please cite:

@article{xu2024kodcode,
        title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},
        author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},
      }