Machine Learning for Software Engineering

This repository contains a curated list of papers, PhD theses, datasets, and tools that are devoted to research on Machine Learning for Software Engineering. The papers are organized into popular research areas so that researchers can find recent papers and state-of-the-art approaches easily.

Please feel free to send a pull request to add papers and relevant content that are not listed here.

Note: to quickly access this page, use ml4se.dev

Papers

Type Inference

Cross-Domain Evaluation of a Deep Learning-Based Type Inference System (2022), arxiv, Gruner, Bernd, et al. [pdf] [code]
Learning To Predict User-Defined Types (2022), TSE'22, Jesse, Keven, et al. [pdf]
Recovering Container Class Types in C++ Binaries (2022), CGO'22, Wang, Xudong, et al.
Finding the Dwarf: Recovering Precise Types from WebAssembly Binaries (2022), PLDI'22, Lehmann, Daniel and Pradel, Michael [pdf]
Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python (2022), ICSE'22, Mir, Amir, et al. [pdf][code]
Static Inference Meets Deep Learning: A Hybrid Type Inference Approach for Python (2022), ICSE'22, Peng, Yun, et al. [pdf]
Type Inference as Optimization (2021), NeurIPS'21 AIPLANS, Pandi, Irene Vlassi, et al. [pdf]
SimTyper: Sound Type Inference for Ruby using Type Equality Prediction (2021), OOPSLA'21, Kazerounian, Milod, et al.
Learning type annotation: is big data enough? (2021), FSE 2021, Jesse, Kevin, et al. [pdf][code]
Cross-Lingual Adaptation for Type Inference (2021), arxiv 2021, Li, Zhiming, et al. [pdf]
PYInfer: Deep Learning Semantic Type Inference for Python Variables (2021), arxiv 2021, Cui, Siwei, et al. [pdf]
Advanced Graph-Based Deep Learning for Probabilistic Type Inference (2020), arxiv 2020, Ye, Fangke, et al. [pdf]
Typilus: Neural Type Hints (2020), PLDI 2020, Allamanis, Miltiadis, et al. [pdf][code]
LambdaNet: Probabilistic Type Inference using Graph Neural Networks (2020), arxiv 2020, Wei, Jiayi, et al. [pdf]
TypeWriter: Neural Type Prediction with Search-based Validation (2019), arxiv 2019, Pradel, Michael, et al. [pdf]
NL2Type: Inferring JavaScript Function Types from Natural Language Information (2019), ICSE 2019, Malik, Rabee S., et al. [pdf][code]
Deep Learning Type Inference (2018), ESEC/FSE 2018, Hellendoorn, Vincent J., et al. [pdf][code]
Python Probabilistic Type Inference with Natural Language Support (2016), FSE 2016, Xu, Zhaogui, et al.
Predicting Program Properties from “Big Code” (2015) ACM SIGPLAN 2015, Raychev, Veselin, et al. [pdf]

Code Completion

Boosting source code suggestion with self-supervised Transformer Gated Highway (2022), JSS, Hussain, Yasir, et al.
Syntax-Aware On-the-Fly Code Completion (2022), arxiv, Takerngsaksiri, W., et al. [pdf]
Learning to Prevent Profitless Neural Code Completion (2022), arxiv, Sun, Z., et al. [pdf]
All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs (2022), arxiv, Bibaev, Vitaliy, et al. [pdf]
CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences (2022), ICSE'22, Izadi, Maliheh, et al. [pdf] [code]
Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs (2021), AAAI'21, Wang, Yanlin, et al. [pdf]
Code Prediction by Feeding Trees to Transformers (2021), ICSE'21, Kim, Seohyun, et al. [pdf]
Fast and Memory-Efficient Neural Code Completion (2020), arxiv 2020, Svyatkovskoy, Alexey, et al. [pdf]
Pythia: AI-assisted Code Completion System (2019), KDD'19, Svyatkovskiy, Alexey, et al. [pdf]
Code Completion with Neural Attention and Pointer Networks (2018), arxiv 2018, Li, Jian, et al. [pdf]

Code Generation

Execution-based Evaluation for Data Science Code Generation Models (2022), arxiv, Huang, Junjie, et al. [pdf]
Multi-lingual Evaluation of Code Generation Models (2022), arxiv, Athiwaratkun, Ben, et al. [pdf][code]
DocCoder: Generating Code by Retrieving and Reading Docs (2022), arxiv, Zhou, Shuyan, et al. [pdf]
Compilable Neural Code Generation with Compiler Feedback (2022), ACL'22, Wang, Xin, et al. [pdf]
T5QL: Taming language models for SQL generation (2022), arxiv, Arcadinho, S., et al. [pdf]
Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation (2022), arxiv, Shen, Sijie, et al. [pdf]
Language Models Can Teach Themselves to Program Better (2022), arxiv, Haluptzok, Patrick, et al. [pdf]
DocCoder: Generating Code by Retrieving and Reading Docs (2022), arxiv, Zhou, Shuyan, et al. [pdf]
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (2022), arxiv, Le, Hung, et al. [pdf]
Repository-Level Prompt Generation for Large Language Models of Code (2022), arxiv, Shrivastava, Disha, et al. [pdf]
CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation (2022), arxiv, Zan, Daoguang, et al. [pdf]
NatGen: Generative pre-training by “Naturalizing” source code (2022), FSE'22, Chakraborty, Saikat, et al. [pdf]
StructCoder: Structure-Aware Transformer for Code Generation (2022), arxiv, Tipirneni, Sindhu, et al. [pdf]
Compilable Neural Code Generation with Compiler Feedback (2022), arxiv 2022, Wang, Xin, et al. [pdf]
Predictive Synthesis of API-Centric Code (2022), arxiv 2022, Nam, Daye, et al. [pdf]
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation (2021), EMNLP'21, Wang, Yue, et al. [pdf]
Evaluating Large Language Models Trained on Code (2021), arxiv 2021, Chen, Mark, et al. [pdf] [code]
Code Prediction by Feeding Trees to Transformers (2020), arxiv 2020, Kim, Seohyun, et al. [pdf]
TreeGen: A Tree-Based Transformer Architecture for Code Generation (2019), arxiv 2019, Zhu, Qihao, et al. [pdf]
A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation (2017), arxiv 2017, Barone, Antonio V. M., et al. [pdf]

Code Summarization

ClassSum: a deep learning model for class-level code summarization (2022), Springer NCA, Li, Mingchen, et al. [code]
Boosting Code Summarization by Embedding Code Structures (2022), COLING'22, Son, Jikyoeng, et al. [pdf]
Low-Resources Project-Specific Code Summarization (2022), ASE'22, Xie, Rui, et al. [pdf]
Few-shot training LLMs for project-specific code-summarization (2022), arxiv, A., Toufique, and P. Devanbu. [pdf]
Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization (2022), FSE'22, Shi, Lin, et al. [pdf]
Learning code summarization from a small and local dataset (2022), arxiv, Ahmed, Toufique, and Devanbu, P. [pdf]
Modeling Hierarchical Syntax Structure with Triplet Position for Source Code Summarization (2022), ACL'22, Guo, Juncai, et al. [pdf]
AST-Trans: Code Summarization with Efficient Tree-Structured Attention (2022), ICSE'22, Tang, Ze, et al. [pdf]
GypSum: Learning Hybrid Representations for Code Summarization (2022), ICPC'22, Wang, Yu, et al. [pdf]
M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization (2022), ICPC'22, Gao, Yuexiu and Lyu, Chen [pdf]
Project-Level Encoding for Neural Source Code Summarization of Subroutines (2021), ICPC'21, Bansal, Aakash, et al. [pdf]
Code Structure Guided Transformer for Source Code Summarization (2021), arxiv 2021, Gao, Shuzheng, et al. [pdf]
Source Code Summarization Using Attention-Based Keyword Memory Networks (2020), IEEE BigComp 2020, Choi, YunSeok, et al.
A Transformer-based Approach for Source Code Summarization (2020), arxiv 2020, Ahmad, Wasi Uddin, et al. [pdf]
Learning to Represent Programs with Graphs (2018), ICLR'18, Allamanis, Miltiadis, et al. [pdf]
A Convolutional Attention Network for Extreme Summarization of Source Code (2016), ICML 2016, Allamanis, Miltiadis, et al. [pdf]

Code Embeddings/Representation

CLAWSAT: Towards Both Robust and Accurate Code Models (2022), arxiv, Jia, Jinghan, et al. [pdf]
sem2vec: Semantics-Aware Assembly Tracelet Embedding (2022), TSE, Wang, Huaijin, et al.
COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning (2022), arxiv, Zhang, Yifan, et al. [pdf]
Soft-Labeled Contrastive Pre-training for Function-level Code Representation (2022), arxiv, Li, Xiaonan, et al. [pdf]
A Tree-structured Transformer for Program Representation Learning (2022), arxiv, Wang, Wenhan, et al. [pdf]
What does Transformer learn about source code? (2022), arxiv, Zhang, Kechi, et al. [pdf]
Test2Vec: An Execution Trace Embedding for Test Case Prioritization (2022), arxiv, Jabbar, Emad, et al. [pdf]
Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code (2022), arxiv, Zhang, Zhaowei, et al. [pdf]
MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning (2022), arxiv, Pian, Weiguo, et al. [pdf]
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (2022), ACL'22, Ding, Yangruibo, et al. [pdf]
Towards Learning Generalizable Code Embeddings using Task-agnostic Graph Convolutional Networks (2022), TOSEM, Ding, Zishuo, et al.
Learning to Represent Programs with Code Hierarchies (2022), arxiv, Nguyen, Minh and Nghi DQ Bui, [pdf]
CV4Code: Sourcecode Understanding via Visual Code Representations (2022), arxiv, Shi, Ruibo, et al. [pdf]
Hyperbolic Representations of Source Code (2022), AAAI'22, Khan, Raiyan, et al. [pdf]
Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification (2022), ICPC'22, Wang, Kesu, et al. [pdf]
Hierarchical Semantic-Aware Neural Code Representation (2022), JSS'22, Jiang, Yuan, et al.
CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training (2022), arxiv 2022, Wang, Xin, et al. [pdf]
Hierarchical Heterogeneous Graph Attention Network for Syntax-Aware Summarization (2022), AAAI'22, Song, Z., and King, I., [pdf]
Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings (2022), ICSE'22, Li, Zongjie, et al. [pdf]
XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM'22, Lin, Zehao, et al.
Fold2Vec: Towards a Statement Based Representation of Code for Code Comprehension (2022), TOSEM'22, Bertolotti, Francesco and Cazzola, Walter
HELoC: Hierarchical Contrastive Learning of Source Code Representation (2022), ICPC'22, Wang, Xiao, et al. [pdf]
Multi-View Graph Representation for Programming Language Processing: An Investigation into Algorithm Detection (2022), AAAI'22, Long, Tin et al. [pdf]
UniXcoder: Unified Cross-Modal Pre-training for Code Representation (2022), arxiv 2022, Guo, Daya, et al. [pdf]
SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations (2022), ICSE'22, Niu, Changan, et al. [pdf]
CoTexT: Multi-task Learning with Code-Text Transformer (2021), arxiv, Phan, Long, et al. [pdf]
TreeCaps: Tree-Based Capsule Networks for Source Code Processing (2021), AAAI'21, Bui, Nghi DQ, et al. [pdf] [code]
Language-Agnostic Representation Learning of Source Code from Structure and Context (2021), ICLR'21, Zügner, Daniel, et al. [pdf]
Learning and Evaluating Contextual Embedding of Source Code (2020), ICML 2020, Kanade, Aditya, et al. [pdf]
Learning Semantic Program Embeddings with Graph Interval Neural Network (2020), OOPSLA'20, Wang, Yu, et al.
Contrastive Code Representation Learning (2020), arxiv 2020, Jain, Paras, et al. [pdf]
Codebert: A Pre-trained Model for Programming and Natural Languages (2020), arxiv 2020, Feng, Zhangyin, et al. [pdf]
SCELMo: Source Code Embeddings from Language Models (2020), arxiv 2020, Karampatsis, Rafael-Michael, et al. [pdf]
code2vec: Learning Distributed Representations of Code (2019), ACM POPL 2019, Alon, Uri, et al. [pdf]
COSET: A Benchmark for Evaluating Neural Program Embeddings (2019), arxiv 2019, Wang, Ke, et al. [pdf]
A Literature Study of Embeddings on Source Code (2019), arxiv 2019, Chen, Zimin, et al. [pdf]
code2seq: Generating Sequences from Structured Representations of Code (2018), arxiv 2018, Alon, Uri, et al. [pdf]
Neural Code Comprehension: A Learnable Representation of Code Semantics (2018), NIPS 2018, Ben-Nun, Tal, et al. [pdf]
Convolutional Neural Networks over Tree Structures for Programming Language Processing (2016), AAAI'16, Mou, Lili, et al. [pdf]

Code Changes

Commit2Vec: Learning Distributed Representations of Code Changes (2021), SN Computer Science, Lozoya, Rocío Cabrera, et al.
CODIT: Code Editing with Tree-Based Neural Models (2020), TSE 2020, Chakraborty, Saikat, et al.
On learning meaningful code changes via neural machine translation (2019), ICSE 2019, Tufano, Michele, et al.

Bug/Vulnerability Detection

Variable-Based Fault Localization via Enhanced Decision Tree (2022), arxiv, Jiang, Jiajun, et al. [pdf]
SPVF: security property assisted vulnerability fixing via attention-based models (2022), EMSE, Zhou, Zhou, et al.
Modeling function-level interactions for file-level bug localization (2022), EMSE, Liang, H., et al.
Practical Automated Detection of Malicious npm Packages (2022), ICSE'22, Sejfia, A., and M. Schäfer [pdf]
Machine Learning for Source Code Vulnerability Detection: What Works and What Isn't There Yet (2022), IEEE Security & Privacy, Marjanov, Tina, et al.
Path-sensitive code embedding via contrastive learning for software vulnerability detection (2022), ISSTA'22, Cheng, Xiao, et al.
VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection (2022), arxiv 2022, Hanif, H. and Maffeis, S. [pdf]
Katana: Dual Slicing-Based Context for Learning Bug Fixes (2022), arxiv 2022, Sintaha, Mifta, et al. [pdf]
LineVul: A Transformer-based Line-Level Vulnerability Prediction (2022), MSR'22, Fu, M., & Tantithamthavorn, C. [pdf][code]
Transformer-Based Language Models for Software Vulnerability Detection: Performance, Model's Security and Platforms (2022), arxiv 2022, Thapa, Chandra, et al. [pdf]
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks (2022), MSR'22, Hin, David, et al. [pdf]
Nalin: Learning from Runtime Behavior to Find Name-Value Inconsistencies in Jupyter Notebooks (2022), ICSE'22, Patra, Jibesh, et al. [pdf]
Hoppity: Learning graph transformations to detect and fix bugs in programs (2020), ICLR 2020, Dinella, Elizabeth, et al. [pdf]
Deep Learning based Software Defect Prediction (2020), Neurocomputing, Qiao, Lei, et al.
Software Vulnerability Discovery via Learning Multi-domain Knowledge Bases (2019), IEEE TDSC, Lin, Guanjun, et al.
Neural Bug Finding: A Study of Opportunities and Challenges (2019), arxiv 2019, Habib, Andrew, et al. [pdf]
Automated Vulnerability Detection in Source Code Using Deep Representation Learning (2018), ICMLA 2018, Russell, Rebecca, et al.
DeepBugs: A Learning Approach to Name-based Bug Detection (2018), ACM PL 2018, Pradel, Michael, et al. [pdf]
Automatically Learning Semantic Features for Defect Prediction (2016), ICSE 2016, Wang, Song, et al.

Source Code Modeling

Do Bugs Lead to Unnaturalness of Source Code? (2022), FSE'22, Jiang, Yanjie, et al.
On the Naturalness of Bytecode Instructions (2022), ASE'22, Choi, Y., and J. Nam. [pdf]
CodeBERT-nt: code naturalness via CodeBERT (2022), arxiv, Khanfir, Ahmed, et al. [pdf]
CommitBART: A Large Pre-trained Model for GitHub Commits (2022), arxiv, Liu, S., et al, [pdf]
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts (2022), ACL'22, Ding, Yangruibo, et al. [pdf]
A Systematic Evaluation of Large Language Models of Code (2022), arxiv 2022, Xu, Frank F., et al. [pdf][code]
Multilingual training for Software Engineering (2022), ICSE'22, Ahmed, Toufique, et al. [pdf]
Unified Pre-training for Program Understanding and Generation (2021), NAACL'21, Ahmad, Wasi Uddin, et al. [pdf]
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code (2020), ICSE'20, Karampatsis, Rafael-Michael, et al.
Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (2019), arxiv 2019, Karampatsis, Rafael-Michael, et al. [pdf]
Are Deep Neural Networks the Best Choice for Modeling Source Code? (2017), FSE 2017, Hellendoorn, Vincent J., et al. [pdf]

Program Repair

- Practical Program Repair in the Era of Large Pre-trained Language Models (2022), arxiv, Xia, C. S. et al. [pdf]
SYNSHINE: improved fixing of Syntax Errors (2022), IEEE TSE, Ahmed, T. et al.
TransRepair: Context-aware Program Repair for Compilation Errors (2022), ASE'22, Li, Xueyang, et al. [pdf]
Repairing Bugs in Python Assignments Using Large Language Models (2022), arxiv, Zhang, Jialu, et al. [pdf]
Repair Is Nearly Generation: Multilingual Program Repair with LLMs (2022), arxiv, Joshi, Harshit, et al. [pdf]
VulRepair: A T5-Based Automated Software Vulnerability Repair (2022), FSE'22, Fu, Michael, et al. [pdf]
Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-shot Learning (2022), FSE'22, Xia, Chunqiu Steven, and Lingming Z. [pdf]
Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes (2022), arxiv, Richter, Cedric, and Heike W. [pdf]
AdaptivePaste: Code Adaptation through Learning Semantics-aware Variable Usage Representations (2022), arxiv 2022, Liu, Xiaoyu, et al. [pdf]
DEAR: A Novel Deep Learning-based Approach for Automated Program Repair (2022), ICSE'22, Li, Yi, et al. [pdf]
TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer (2021), ICML'21, Berabi, Berkay, et al. [pdf]
Neural Transfer Learning for Repairing Security Vulnerabilities in C Code (2021), Chen, Zimin, et al. [pdf]
Generating Bug-Fixes Using Pretrained Transformers (2021), arxiv 2021, Drain, Dawn, et al. [pdf]
Global Relational Models of Source Code (2020), ICLR'20, Hellendoorn, Vincent J., et al. [pdf]
Neural Program Repair by Jointly Learning to Localize and Repair (2019), arxiv 2019, Vasic, Marko, et al. [pdf]

Program Translation

The Effectiveness of Transformer Models for Analyzing Low-Level Programs (2022), MIT Primes, Zifan Guo [pdf]
Code Translation with Compiler Representations (2022), arxiv, Szafraniec, Marc, et al. [pdf]
BabelTower: Learning to Auto-parallelized Program Translation (2022), ICML'22, Wen, Yuanbo, et al. [pdf]
Multilingual Code Snippets Training for Program Translation (2022), AAAI'22, Zhu, Ming, et al. [pdf]
Semantics-Recovering Decompilation through Neural Machine Translation (2021), arxiv 2021, Liang, Ruigang, et al. [pdf]
Unsupervised Translation of Programming Languages (2020), arxiv 2020, Lachaux, Marie-Anne et al. [pdf]

Program Analysis

AutoPruner: Transformer-Based Call Graph Pruning (2022), FSE'22, Le-Cong, Thanh, et al. [pdf][code]
Striking a Balance: Pruning False-Positives from Static Call Graphs (2022), ICSE'22, Utture, Akshay, et al. [pdf][code]

Code Clone Detection

Efficient transformer with code token learner for code clone detection (2022), arxiv, Zhang, Aiping, et al.
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection (2022), arxiv, Zubkov, Maksim, et al. [pdf]
Cross-Language Source Code Clone Detection Using Deep Learning with InferCode (2022), arxiv 2022, Yahya, M., and Kim, D., [pdf]
funcGNN: A Graph Neural Network Approach to Program Similarity (2020), ESEM'20, Nair, Aravind, et al. [pdf]
Cross-Language Clone Detection by Learning Over Abstract Syntax Trees (2019), MSR'19, Perez, Daniel, et al.
The Adverse Effects of Code Duplication in Machine Learning Models of Code (2019), Onward! 2019, Allamanis, Miltiadis, [pdf]

Code Search

You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search (2022), FSE'22, Wan, Yao, et al.
How to Better Utilize Code Graphs in Semantic Code Search? (2022), FSE'22, Shi, Yucen, et al.
Exploring Representation-Level Augmentation for Code Search (2022), EMNLP'22, Li, Haochen, et al. [pdf][code]
A code search engine for software ecosystems (2022), CEUR, Pfaff, Chris, et al. [pdf]
Cross-Domain Deep Code Search with Meta Learning (2022), ICSE'22, Chai, Yitian, et al. [pdf]

Empirical Studies

Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs? (2022), ASE'22, Richter, Cedric, et al. [pdf]
Do Pre-trained Language Models Indeed Understand Software Engineering Tasks? (2022), arxiv, Li, Yao, et al. [pdf]
A large-scale empirical study of commit message generation: models, datasets and evaluation (2022), EMSE, Tao, Wei, et al.
Examining Zero-Shot Vulnerability Repair with Large Language Models (2022), IEEE SP, Pearce, H., et al.
Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration (2022), arxiv, Paltenghi, M., et al. [pdf]
SimSCOOD: Systematic Analysis of Out-of-Distribution Behavior of Source Code Models (2022), arxiv, Hajipour, H., et al. [pdf]
Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs? (2022), ASE'22, Richter, Cedric, et al. [pdf]
Open Science in Software Engineering: A Study on Deep Learning-Based Vulnerability Detection (2022), TSE, Nong, Yu, et al. [pdf]
A controlled experiment of different code representations for learning-based program repair (2022), EMSE, Namavar, M., et al.
What is it like to program with artificial intelligence? (2022), arxiv, Sarkar, Advait, et al. [pdf]
Security Implications of Large Language Model Code Assistants: A User Study (2022), arxiv, Sandoval, Gustavo, et al. [pdf]
An Empirical Study of Code Smells in Transformer-based Code Generation Techniques (2022), arxiv, Siddiq, M. L. et al. [pdf]
No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence (2022), FSE'22, Wang, Chaozheng, et al. [pdf]
Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study (2022), FSE'22, Nong, Yu, et al. [pdf]
GitHub Copilot AI pair programmer: Asset or Liability? (2022), arxiv, Dakhel, Arghavan Moradi, et al. [pdf]
Evaluating the Impact of Source Code Parsers on ML4SE Models (2022), arxiv, Utkin, Ilya, et al [pdf]
An extensive study on pre-trained models for program understanding and generation (2022), ISSTA'22, Zeng, Zhengran, et al.
Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code (2022), arxiv, Bareiß, Patrick, et al. [pdf]
Assessing Project-Level Fine-Tuning of ML4SE Models (2022), arxiv, Bogomolov, Egor, et al. [pdf]
On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages (2022), ICPC'22, Chen, Fuxiang, et al. [pdf]
Learning Program Semantics with Code Representations: An Empirical Study (2022), SANER'22, Siow, Jing Kai, et al. [pdf][code]
Assessing Generalizability of CodeBERT (2021), ICSME'21, Zhou, Xin, et al.
Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code (2021), ASE'21, Paltenghi, M. & Pradel, M.
An Empirical Study of Transformers for Source Code (2021), FSE'21, Chirkova, N., & Troshin, S.
An Empirical Study on the Usage of Transformer Models for Code Completion (2021), MSR'21, Ciniselli, Matteo, et al.

Surveys

Deep Learning Meets Software Engineering: A Survey on Pre-Trained Models of Source Code (2022), arxiv 2022, Niu, Changan, et al. [pdf]
A Survey of Deep Learning Models for Structural Code Understanding (2022), arxiv 2022, Wu, Ruoting, et al. [pdf]
A Survey on Machine Learning Techniques for Source Code Analysis (2021), arxiv 2021, Sharma, Tushar, et al. [pdf]
Deep Learning & Software Engineering: State of Research and Future Directions (2020), arxiv 2020, Devanbu, Prem, et al. [pdf]
A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research (2020), arxiv 2020, Watson, Cody, et al. [pdf]
Machine Learning for Software Engineering: A Systematic Mapping (2020), arxiv 2020, Shafiq, Saad, et al. [pdf]
Synergy between Machine/Deep Learning and Software Engineering: How Far Are We? (2020), arxiv 2020, Wang, Simin, et al. [pdf]
Software Engineering Meets Deep Learning: A Literature Review (2020), arxiv 2020, Ferreira, Fabio, et al. [pdf]
Software Vulnerability Detection Using Deep Neural Networks: A Survey (2020), Proceedings of the IEEE, Lin, Guanjun, et al.
Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges (2020), arxiv 2020, Le, Triet HM, et al. [pdf]
A Survey of Machine Learning for Big Code and Naturalness (2018), ACM Computing Surveys, Allamanis, Miltiadis, et al. [pdf]

Misc

Neural Language Models for Code Quality Identification (2022), arxiv, Sengamedu, S., et al.
Detecting Security Patches in Java Projects Using NLP Technology (2022), ICNLSP'22, Stefanoni, Andrea, et al. [pdf]
Program Merge Conflict Resolution via Neural Transformers (2022), FSE'22, Svyatkovskiy, Alexey, et al.
Automating code review activities by large-scale pre-training (2022), FSE'22, Li, Zhiyu, et al. [[pdf]]
Teaching Algorithmic Reasoning via In-context Learning (2022), arxiv, Zhou, Hattie, et al [pdf]
Improved Evaluation of Automatic Source Code Summarisation (2022), arxiv, Phillips, Jesse, et al. [pdf]
Towards Generalizable and Robust Text-to-SQL Parsing (2022), arxiv, Gao, Chang, et al. [pdf]
CodeEditor: Learning to Edit Source Code with Pre-trained Models (2022), arxiv, Li, Jia, et al. [pdf]
Poison Attack and Defense on Deep Source Code Processing Models (2022), arxiv, Li, Jia, et al. [pdf]
NEUDEP: Neural Binary Memory Dependence Analysis (2022), FSE'22, Pei, Kexin, et al. [pdf]
Novice Type Error Diagnosis with Natural Language Models (2022), arxiv, Geng, Chuqin, et al. [pdf]
CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure (2022), arxiv, Chen, Nuo, et al. [pdf]
Using Large Language Models to Enhance Programming Error Messages (2022), SIGCSE'22, Leinonen, J., et al. [pdf]
Automatic Code Documentation Generation Using GPT-3 (2022), ASE'22, Khan, J. Y., and G. Uddin. [pdf]
So Much in So Little: Creating Lightweight Embeddings of Python Libraries (2022), arxiv, Golubev, Yaroslav, et al. [pdf]
Code Compliance Assessment as a Learning Problem (2022), arxiv, Sawant, N., and S. H. Sengamedu [pdf]
Learning-based Identification of Coding Best Practices from Software Documentation (2022), ICSME'22, Sawant, N., and S. H. Sengamedu [pdf]
Learning to Answer Semantic Queries over Code (2022), arxiv, Sahu, Surya Prakash, et al. [pdf]
XFL: Naming Functions in Binaries with Extreme Multi-label Learning (2022), arxiv, Patrick-Evans, J., et al. [pdf]
SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings (2022), Jin, Xin, et al. [pdf]
Out of the BLEU: how should we assess quality of the Code Generation models? (2022), arxiv, Evtikhiev, Mikhail, et al. [pdf]
Compressing Pre-trained Models of Code into 3 MB (2022), arxiv, Shi, Jieke, et al. [pdf]
A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages (2022), arxiv, Cassano, Federico, et al. [pdf]
AUGER: Automatically Generating Review Comments with Pre-training Models (2022), FSE'22, Li, Lingwei, et al. [pdf]
Overwatch: Learning Patterns in Code Edit Sequences (2022), arxiv, Zhang, Yuhao, et al. [pdf]
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing (2022), KDD'22, Wang, Lihan, et al. [pdf]
DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class (2022), TOSEM, Dramko, Luke, et al.
Making Python Code Idiomatic by Automatic Refactoring Non-Idiomatic Python Code with Pythonic Idioms (2022), arxiv, Zhang, Zejun, et al. [pdf]
DeepPERF: A Deep Learning-Based Approach For Improving Software Performance (2022), arxiv, Garg, Spandan, et al. [pdf]
CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code (2022), ICSE ’22 Companion, Eghbali, Aryaz, and Michael, P. [pdf]
Impact of Evaluation Methodologies on Code Summarization (2022), ACL, Nie, Pengyu, et al. [pdf]

PhD Theses

Improving Programming Productivity with Statistical Models (2022), Tam Nguyen [pdf]
Learning to Find Bugs in Programs and their Documentation (2021), Andrew Habib [pdf]
Machine Learning and the Science of Software Engineering (2020), Vincent Hellendoorn
Deep learning for compilers (2020), Christopher E. Cummins [pdf]
Deep Learning in Software Engineering (2020), Cody Watson [pdf]
Learning Code Transformations via Neural Machine Translation (2019), Michele Tufano [pdf]
Improving the Usability of Static Analysis Tools Using Machine Learning (2019), Ugur Koc [pdf]
Learning Natural Coding Conventions (2016), Miltiadis Allamanis [pdf]

Talks

Machine Learning for Software Engineering: AMA, MSR 2020 [video]
Understanding Source Code with Deep Learning, FOSDEM 2019 [video]

Datasets

CS1QA (2022) - A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course
XLCoST (2022) - A Benchmark Dataset for Cross-lingual Code Intelligence
CodeS (2022) - CodeS: A Distribution Shift Benchmark Dataset for Source Code Learning
ManyTypes4TypeScript (2022) - Type prediction dataset for TypeScript
HumanEval - Program synthesis from code comments
GitHub Code (2022) - 115M LoC in 32 programming languages
D2A (2021) - A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis
CodeXGLUE (2021)
ogbg-code2 (2021)
ManyTypes4Py (2021) - Type prediction dataset for Python
CodeSearchNet (2020)
ManySStuBs4J (2019)
150k Python Dataset (2016)
150k Javascript Dataset (2016)
GitHub Java Corpus (2013)

Tools

Source Code Analysis & Processing

LibSA4Py - LibSA4Py: Light-weight static analysis for extracting type hints and features
LibCST - A concrete syntax tree parser library for Python
python-graphs - A static analysis library for computing graph representations of Python programs suitable for use with graph neural networks.
Semantic - Parsing, analyzing, and comparing source code across many languages
GraphGen4Code - A toolkit for creating code knowledge graphs based on WALA code analysis and extraction of documentation
Joern - Code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs
NaturalCC - An Open-Source Toolkit for Code Intelligence
Scalpel - The Python Static Analysis Framework
WALA - T.J. Watson Libraries for Analysis, with frontends for Java, Android, and JavaScript

Machine Learning

SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation
Hugging Face - Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Code de-duplication

CD4Py - Code De-Duplication for Python
Near-duplicate Source Code Detector

Misc

Utilities by the DPU team of Microsoft
A set of tools to work with Big Code - Fetching GitHub repos, tokenizers, embeddings and etc
cloc - Counts blank lines, comment lines, and physical lines of source code in many programming languages.

Research Groups

Software Engineering Research Group (SERG), Delft University of Technology
Secure, Reliable, and Intelligent Systems Lab (SRI), ETH Zurich
Software Lab (SOLA), University of Stuttgart
Machine Learning for the Analysis of Source Code Text (MAST), Edinburgh University
Deep Program Understanding, Microsoft Research
DECAL (Davis Excellent/Eclectic/Extreme Computational Analytics Lab), UC Davis
JetBrains Research
SMart software Analysis and Trustworthy computing Lab (SMAT), Monash University

Venues

ICSE, the International Conference on Software Engineering
FSE, Symposium on the Foundations of Software Engineering
ASE, the International Conference on Automated Software Engineering
MSR, the Mining Software Repositories conference
ICPC, the International Conference on Program Comprehension
ICLR, the International Conference on Learning Representations
ICML, the International Conference on Machine Learning
AAAI, the Association for the Advancement of Artificial Intelligence
ACL, the Association for Computational Linguistics
OOPSLA, the ACM Conference on Systems, Programming, Languages, and Applications
TSE, the IEEE Transactions on Software Engineering
TOSEM, ACM Transactions on Software Engineering and Methodology
JSS, Journal of Systems and Software

martinezmatias/ml4se

Machine Learning for Software Engineering

Content

Papers

Type Inference

Code Completion

Code Generation

Code Summarization

Code Embeddings/Representation

Code Changes

Bug/Vulnerability Detection

Source Code Modeling

Program Repair

Program Translation

Program Analysis

Code Clone Detection

Code Search

Empirical Studies

Surveys

Misc

PhD Theses

Talks

Datasets

Tools

Source Code Analysis & Processing

Machine Learning

Code de-duplication

Misc

Research Groups

Venues