/CodeGen

CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

CodeGen

Official release for the CodeGen1 and CodeGen2 models (350M, 1B, 3B, 7B 16B) for Program Synthesis by Salesforce AI.

Publications

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp*, Bo Pang*, Hiroaki Hayashi*, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong
ICLR, 2023

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
Erik Nijkamp*, Hiroaki Hayashi*, Caiming Xiong, Silvio Savarese, and Yingbo Zhou
ICLR, 2023

Usage

The models are available on the HuggingFace Hub.

CodeGen1

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
inputs = tokenizer("# this function prints hello world", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

CodeGen2

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-7B")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-7B", trust_remote_code=True, revision="main")
inputs = tokenizer("# this function prints hello world", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))

Training

The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:

https://github.com/salesforce/jaxformer

Citation

If you find our code or paper useful, please cite the paper:

@article{nijkamp2022codegen,
  title={CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis},
  author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
  journal={ICLR},
  year={2023}
}

@article{nijkamp2023codegen2,
  title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
  author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
  journal={ICLR},
  year={2023}
}