Awesome Machine Learning On Source Code

A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.

Digests
Articles
- Papers
- Posts
- Talks
Software
- Machine Learning
- Utilities
Datasets
Credits
Contributions
License

Digests

Articles

Papers

Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le.
Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu .
Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi.
Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli.
Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli.
Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu.
A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton.
Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai.
Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan.
A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich.
SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt.
Topic modeling of public repositories at scale using names in source code
A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo.
RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli.
Neural Programming by Example - Chengxun Shu, Hongyu Zhang.
Parameter-Free Probabilistic API Mining across GitHub
A Subsequence Interleaving Model for Sequential Pattern Mining
Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim.
A Convolutional Attention Network for Extreme Summarization of Source Code
Tailored Mutants Fit Bugs Better
A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham.
TASSAL: Autofolding for Source Code Summarization
Suggesting Accurate Method and Class Names
Mining idioms from source code
Mining Source Code Repositories at Massive Scale using Language Modeling
Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom.
Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav.
Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin.
Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel.
Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach.
DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.
Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel.
Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli.
Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton.
Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau.
Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow.
TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow.
Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka.
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao.
Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy.
Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever.
Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna.
Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas.
Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin.
A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen.
From Machine Learning to Machine Reasoning - Bottou, Leon.
Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio.
Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow.
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov.
Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever.
API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou.
Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA.
Program Synthesis from Natural Language Using Recurrent Neural Networks - University of Washington CSE, Seatle, WA, USA.
Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen.
Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang.
Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] - Yaqin Zhou and Asankhaya Sharma.
Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine.

Posts

Talks

Software

Machine Learning

Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
vecino - Finds similar Git repositories.
apollo - Source code deduplication as scale, research.
gemini - Source code deduplication as scale, production.
enry - Insanely fast file based programming language detector.
Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

Utilities

go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
hercules - Git repository mining framework with batteries on top of go-git.
bblfsh - Self-hosted server for source code parsing.
engine - Scalable and distributed data retrieval pipeline for source code.
minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.

Datasets

GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos.
GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
150k Python Dataset - Dataset consisting of 150'000 Python ASTs.
150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs.
card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.

orkhan103/awesome-machine-learning-on-source-code

Awesome Machine Learning On Source Code

Contents

Digests

Articles

Papers

Posts

Talks

Software

Machine Learning

Utilities

Datasets

Credits

Contributions

License