A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.
- Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le.
- Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu .
- Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi.
- Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli.
- Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli.
- Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu.
- A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton.
- Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai.
- Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan.
- A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich.
- SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt.
- Topic modeling of public repositories at scale using names in source code
- A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo.
- RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli.
- Neural Programming by Example - Chengxun Shu, Hongyu Zhang.
- Parameter-Free Probabilistic API Mining across GitHub
- A Subsequence Interleaving Model for Sequential Pattern Mining
- Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim.
- A Convolutional Attention Network for Extreme Summarization of Source Code
- Tailored Mutants Fit Bugs Better
- A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham.
- TASSAL: Autofolding for Source Code Summarization
- Suggesting Accurate Method and Class Names
- Mining idioms from source code
- Mining Source Code Repositories at Massive Scale using Language Modeling
- Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
- Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom.
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav.
- Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin.
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel.
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach.
- DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel.
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli.
- Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton.
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau.
- Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow.
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow.
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka.
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao.
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy.
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever.
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna.
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas.
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin.
- A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen.
- From Machine Learning to Machine Reasoning - Bottou, Leon.
- Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio.
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow.
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov.
- Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou.
- Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA.
- Program Synthesis from Natural Language Using Recurrent Neural Networks - University of Washington CSE, Seatle, WA, USA.
- Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen.
- Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
- Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] - Yaqin Zhou and Asankhaya Sharma.
- Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine.
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- hercules - Git repository mining framework with batteries on top of go-git.
- bblfsh - Self-hosted server for source code parsing.
- engine - Scalable and distributed data retrieval pipeline for source code.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150'000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs.
- card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.
- A lot of references and articles were taken from mast-group
See CONTRIBUTING.md.