Awesome Machine Learning On Source Code

A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.

If you want to contribute to this list (please do), send a pull request or contact source{d} @srcd_.

Also, a listed repository should be deprecated if:

Repository's owner explicitly say that "this library is not maintained".
Not committed for long time (2~3 years).

Digests
Articles
- Machine learning
Frameworks
- Machine Learning
- Utilities
Datasets
- Datasets
Credits

Digests

Learning from "Big Code" A Survey of Machine Learning for Big Code and Naturalness

Articles

Machine learning articles about processing source code

Topic modeling of public repositories at scale using names in source code
Topic Modeling of GitHub Repositories
Similarity of GitHub Repositories by Source Code Identifiers
Using deep RNN to model source code
Source code abstracts classification using CNN (1)
Source code abstracts classification using CNN (2)
Source code abstracts classification using CNN (3)
Embedding the GitHub contribution graph
Weighted MinHash on GPU helps to find duplicate GitHub repositories.
Parameter-Free Probabilistic API Mining across GitHub
A Subsequence Interleaving Model for Sequential Pattern Mining
A Convolutional Attention Network for Extreme Summarization of Source Code
Parameter-Free Probabilistic API Mining across GitHub
Tailored Mutants Fit Bugs Better
TASSAL: Autofolding for Source Code Summarization
Suggesting Accurate Method and Class Names
Mining idioms from source code
Mining Source Code Repositories at Massive Scale using Language Modeling
Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
Latent Predictor Networks for Code Generation - Address the problem of generating programming code from a mixed natural language and structured specification. Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom
Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav
Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin
Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel
Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach
DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow
Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel
Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli
Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton
Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau
Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow
TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow
Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao
Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy
Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever
Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna
Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas
Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin
A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen
From Machine Learning to Machine Reasoning - Bottou, Leon
Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio
Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov
Neural GPUs Learn Algorithms Kaiser, Łukasz, and Ilya Sutskever
API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou
Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA
Program Synthesis from Natural Language Using Recurrent Neural Networks University of Washington CSE, Seatle, WA, USA
Exploring API Embedding for API Usages and Applications Nguyen, Nguyen, Phan and Nguyen
Neural Nets Can Learn Function Type Signatures From Binaries Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang
Deep Learning Code Fragments for Code Clone Detection Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Published at ASE'16
Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] Yaqin Zhou and Asankhaya Sharma. Published at ESEC/FSE 2017.

Frameworks

Machine Learning frameworks/libraries

Differentiable Neural Computer (DNC) - A TensorFlow implementation of the Differentiable Neural Computer.
sourced.ml - Abstracts feature extraction from source code syntax trees and working with models
vecino - Discovering similar Git repositories
enry - Insanely fast file based programming language detector.
Naturalize - Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Extreme Source Code Summarization - A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Summarizing Source Code using a Neural Attention Model - CODE-NN , uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
Probabilistic API Miner - PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Interesting Sequence Miner - ISM is a novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
TASSAL - TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

Frameworks for preprocessing source code, etc.

go-git - A highly extensible Git implementation in pure Go.
bblfsh - A self-hosted server for source code parsing
engine - source{d}, a scalable and distributed data retrieval pipeline for source code
minhashcuda - source{d}, to efficiently remove duplicates of repositories on nBOW model
kmcuda - source{d}, to cluster and to search for nearest neighbors in dense space
wmd-relax - source{d}, to find nearest neighbors at Word Mover's Distance - to find nearest repositories
swivel-spark-prep - Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
hercules - Calculates the lines burnout stats in a Git repository

Source code datasets

GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016)
452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016)
GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016)
from language X to Y - The cache file Erik Bernhardsson collected for his awesome blog post
GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos
GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people!
GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other
GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories
GitHub Java Corpus - The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
150k Python Dataset - Dataset consisting of 150'000 Python ASTs
150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs
card2code - This dataset contains the language to code datasets described in our paper: Latent Predictor Networks for Code Generation

Credits

A lot of references and articles were taken from mast-group

riteshhota2008/awesome-machine-learning-on-source-code