cutestar0701/awesome-programming-language-pretraining-papers

Recent Advances in Programming Language Pre-Trained Models (PL-PTMs)

Recent Advances in Programming Language Pre-Trained Models (PL-PTMs)

Maintained by WANG Yue (wangyue2714@gmail.com). Last update on 2021/12/17.

General PL-PTMs

Learning and Evaluating Contextual Embedding of Source Code, [code] ICML 2020 (CuBERT)

CodeBERT:A Pre-Trained Model for Programming and Natural Languages, [code] EMNLP 2020 Findings, (CodeBERT)

GraphCodeBERT: Pre-training Code Representations with Data Flow, [code] ICLR 2021 (GraphCodeBERT)

Unified Pre-training for Program Understanding and Generation, [code] NAACL 2021 (PLBART)

Unsupervised Translation of Programming Languages, [code] NeurIPS 2020 (TransCoder)

Exploring Software Naturalness through Neural Language Models, arXiv 2020/06 (C-BERT)

PYMT5: multi-mode translation of natural language and PYTHON code with transformers, EMNLP 2020 (PYMT5)

Contrastive Code Representation Learning, [code] arXiv 2020/07 (ContraCode)

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages, arXiv 2021/02 (DOBF)

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks, [code] ICSE 2021

CodeTrans: Towards Cracking the Language of Silicone’s Code Through Self-Supervised Deep Learning and High Performance Computing, [code] arXiv 2021/04 (CodeTrans)

How could Neural Networks understand Programs?, [code] ICML 2021 (OSCAR)

CoTexT: Multi-task Learning with Code-Text Transformer, arXiv 2021/05 (CoTexT)

Disentangled Code Representation Learning for Multiple Programming Languages, ACL-Fingings 2021 (CODEDISEN)

SYNCOBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation, arXiv 2021/09 (SYNCOBERT)

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language, UAI 2021

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021 [code] [blog] [media][slide][poster]

Task-specific PL-PTMs

Code Completion: Multi-task Learning based Pre-trained Language Model for Code Completion, ASE 2020 (CugLM)

Code Completion: IntelliCode Compose: Code Generation using Transformer, FSE 2020 (IntelliCode Compose)

Code Completion: Improving Code Autocompletion with Transfer Learning, arXiv 2021/05

Program Repair: Generating Bug-Fixes Using Pretrained Transformers, arXiv 2021/04 (DeepCode)

Program Repair: DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons, arXiv 2021/05 (DeepDebug)

Program Repair: TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer, ICML 2021

Program Repair: CURE: Code-Aware Neural Machine Translation for Automatic Program Repair, ICSE 2021

Unit Test Generation: Unit Test Case Generation with Transformers and Focal Context, arXiv 2021/05

Code Generation: Evaluating Large Language Models Trained on Code, arXiv 2021/07 (Codex)

Code Generation: Program Synthesis with Large Language Models, arXiv 2021/08

Other Deep Models for Code-related Tasks

Language-Agnostic Representation Learning of Source Code from Structure and Context, [code] ICLR 2021 (Code Transformer)

GN-Transformer: Fusing AST and Source Code information in Graph Networks, openreview 2020/09 (GN-Transformer)

Program Repair: HOPPITY: LEARNING GRAPH TRANSFORMATIONS TO DETECT AND FIX BUGS IN PROGRAMS, ICLR 2020 (HOPPITY)

Benchmarks & Datasets

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, [code] arXiv 2021/02

Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks [code]

Measuring Coding Challenge Competence With APPS, arXiv 2021/05