Awesome Machine Learning On Source Code
A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.
Contents
- Digests
- Conferences
- Papers
- Program Synthesis and Induction
- Source Code Analysis and Language modeling
- Neural Network Architectures and Algorithms
- Program Translation
- Code Suggestion and Completion
- Program Repair and Bug Detection
- APIs and Code Mining
- Code Optimization
- Topic Modeling
- Code Summarization
- Clone Detection
- Differentiable Interpreters
- Related research
(links require "Related research" spoiler to be open)
- Posts
- Talks
- Software
- Datasets
- Credits
- Contributions
- License
Digests
Conferences
- Workshop on NLP for Software Engineering
- SysML
- Mining Software Repositories
- AIFORSE
- source{d} tech talks
- NIPS Neural Abstract Machines and Program Induction workshop
Papers
Program Synthesis and Induction
- NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System - Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, 2018.
- Recent Advances in Neural Program Synthesis - Neel Kant, 2018.
- Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine, 2018.
- Neural Program Search: Solving Programming Tasks from Description and Examples - Illia Polosukhin, Alexander Skidanov, 2018.
- Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le, 2018.
- Towards Synthesizing Complex Programs from Input-Output Examples - Xinyun Chen, Chang Liu, Dawn Song, 2018.
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning - Xiaojun Xu, Chang Liu, Dawn Song, 2017.
- Learning to Select Examples for Program Synthesis - Yewen Pu, Zachery Miranda, Armando Solar-Lezama, Leslie Pack Kaelbling, 2017.
- Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli, 2017.
- Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai, 2017.
- Learning to Infer Graphics Programs from Hand-Drawn Images - Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum, 2017.
- Neural Attribute Machines for Program Generation - Matthew Amodio, Swarat Chaudhuri, Thomas Reps, 2017.
- Abstract Syntax Networks for Code Generation and Semantic Parsing - Maxim Rabinovich, Mitchell Stern, Dan Klein, 2017.
- Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, 2017.
- A Syntactic Neural Model for General-Purpose Code Generation - Pengcheng Yin, Graham Neubig, 2017.
- Program Synthesis from Natural Language Using Recurrent Neural Networks - Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael Ernst, 2017.
- RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli, 2017.
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow, 2017.
- Neural Programming by Example - Chengxun Shu, Hongyu Zhang, 2017.
- DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow, 2017.
- A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen, 2017.
- Latent Attention For If-Then Program Synthesis - Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, Mingcheng Chen, 2016.
- Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom, 2016.
- Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton, 2016.
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao, 2016.
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin, 2016.
- Structured Generative Models of Natural Source Code - Chris J. Maddison, Daniel Tarlow, 2014.
Source Code Analysis and Language modeling
- A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2017.
- Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi, 2017.
- A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham, 2016.
- Suggesting Accurate Method and Class Names - Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton, 2015.
- Mining Source Code Repositories at Massive Scale using Language Modeling - Miltiadis Allamanis, Charles Sutton, 2013.
Neural Network Architectures and Algorithms
- Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks - Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu, 2017.
- Syntax-Directed Variational Autoencoder for Structured Data - Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song, 2018.
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna, 2017.
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach, 2016.
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau, 2016.
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy, 2016.
- Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio, 2016.
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever, 2016.
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas, 2016.
- Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever, 2016.
- Neural Random-Access Machines - Karol Kurach, Marcin Andrychowicz, Ilya Sutskever, 2016.
- Learning to Execute - Wojciech Zaremba, Ilya Sutskever, 2015.
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov, 2015.
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka, 2014.
- From Machine Learning to Machine Reasoning - Bottou, Leon, 2011.
Program Translation
- Tree-to-tree Neural Networks for Program Translation - Xinyun Chen, Chang Liu, Dawn Song, 2018.
- Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu, 2017.
- Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan, 2017.
- A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich, 2017.
- A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo, 2017.
Code Suggestion and Completion
- Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu, 2017.
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel, 2016.
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav, 2014.
Program Repair and Bug Detection
- Dynamic Neural Program Embedding for Program Repair - Ke Wang, Rishabh Singh, Zhendong Su, 2018.
- To Type or Not to Type: Quantifying Detectable Bugs in JavaScript - Zheng Gao, Christian Bird, Earl Barr, 2017.
- Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli, 2017.
- Automated Identification of Security Issues from Commit Messages and Bug Reports - Yaqin Zhou and Asankhaya Sharma, 2017.
- SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt, 2017.
- End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks - Min-je Choi, Sehun Jeong, Hakjoo Oh, Jaegul Choo, 2017.
- Tailored Mutants Fit Bugs Better - Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton, 2016.
APIs and Code Mining
- DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, 2017.
- Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, 2017.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou, 2017.
- Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen, 2017.
- Parameter-Free Probabilistic API Mining across GitHub - Jaroslav Fowkes, Charles Sutton, 2016.
- A Subsequence Interleaving Model for Sequential Pattern Mining - Jaroslav Fowkes, Charles Sutton, 2016.
- Lean GHTorrent: GitHub data on demand - Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman, 2014.
- Mining idioms from source code - Miltiadis Allamanis, Charles Sutton, 2014.
- The GHTorent Dataset and Tool Suite - Georgios Gousios, 2013.
Code Optimization
- The Case for Learned Index Structures - Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis, 2017.
- Learning to superoptimize programs - Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohlim 2017.
- Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, 2017.
- Adaptive Neural Compilation - Rudy Bunel, Alban Desmaison, Pushmeet Kohli, Philip H.S. Torr, M. Pawan Kumar, 2016.
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli, 2016.
Topic Modeling
- Topic modeling of public repositories at scale using names in source code - Vadim Markovtsev, Eiso Kant, 2017.
- Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code - Miltiadis Allamanis, Charles Sutton, 2013.
- Semantic clustering: Identifying topics in source code - Adrian Kuhn, Stéphane Ducasse, Tudor Girba, 2007.
Code Summarization
- A Convolutional Attention Network for Extreme Summarization of Source Code - Miltiadis Allamanis, Hao Peng, Charles Sutton, 2016.
- TASSAL: Autofolding for Source Code Summarization - Jaroslav Fowkes, Pankajan Chanthirasegaran, Razvan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton, 2016.
- Summarizing Source Code using a Neural Attention Model - Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, 2016.
Clone Detection
- DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, 2017.
- Some from Here, Some from There: Cross-project Code Reuse in GitHub - Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, 2017.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, 2016.
- A study of repetitiveness of code changes in software evolution - HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, 2013.
Differentiable Interpreters
- Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction - Da Xiao, Jo-Yu Liao, Xingyuan Yuan, 2018.
- Differentiable Programs with Neural Libraries - Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow, 2017.
- Differentiable Functional Program Interpreters - John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, Daniel Tarlow, 2017.
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel, 2017.
- Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow, 2017.
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow, 2016.
Related research
Binary Data Modeling
- Clustering Binary Data with Bernoulli Mixture Models - Neal S. Grantham
- A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki
- BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray
Soft Clustering Using T-mixture Models
- Robust mixture modelling using the t distribution - D. PEEL and G. J. MCLACHLAN
- Robust mixture modeling using the skew t distribution - Tsung I. Lin, Jack C. Lee and Wan J. Hsieh
Posts
- Sequence Intent Classification Using Hierarchical Attention Networks
- Syntax-Directed Variational Autoencoder for Structured Data
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- Using recurrent neural networks to predict next tokens in the java solutions
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
Talks
- Topic Modeling of GitHub Repositories
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Measuring code sentiment in a Git repository
Software
Machine Learning
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Clone Digger - clone detection for Python and Java.
Utilities
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- bblfsh - Self-hosted server for source code parsing.
- engine - Scalable and distributed data retrieval pipeline for source code.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
Datasets
- Public Git Archive - 3 TB of Git repositories from GitHub.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150'000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs.
- card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the paper.
Credits
- A lot of references and articles were taken from mast-group
Contributions
See CONTRIBUTING.md.