A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.
If you want to contribute to this list (please do), send a pull request or contact source{d} @srcd_.
Also, a listed repository should be deprecated if:
- Repository's owner explicitly say that "this library is not maintained".
- Not committed for long time (2~3 years).
Learning from "Big Code" A Survey of Machine Learning for Big Code and Naturalness
- Topic modeling of public repositories at scale using names in source code
- Topic Modeling of GitHub Repositories
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Parameter-Free Probabilistic API Mining across GitHub
- A Subsequence Interleaving Model for Sequential Pattern Mining
- A Convolutional Attention Network for Extreme Summarization of Source Code
- Parameter-Free Probabilistic API Mining across GitHub
- Tailored Mutants Fit Bugs Better
- TASSAL: Autofolding for Source Code Summarization
- Suggesting Accurate Method and Class Names
- Mining idioms from source code
- Mining Source Code Repositories at Massive Scale using Language Modeling
- Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
- Latent Predictor Networks for Code Generation - Address the problem of generating programming code from a mixed natural language and structured specification. Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav
- Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach
- DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli
- Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau
- Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin
- A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen
- From Machine Learning to Machine Reasoning - Bottou, Leon
- Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov
- Neural GPUs Learn Algorithms Kaiser, Łukasz, and Ilya Sutskever
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou
- Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA
- Program Synthesis from Natural Language Using Recurrent Neural Networks University of Washington CSE, Seatle, WA, USA
- Exploring API Embedding for API Usages and Applications Nguyen, Nguyen, Phan and Nguyen
- Neural Nets Can Learn Function Type Signatures From Binaries Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang
- Deep Learning Code Fragments for Code Clone Detection Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Published at ASE'16
- Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] Yaqin Zhou and Asankhaya Sharma. Published at ESEC/FSE 2017.
- Differentiable Neural Computer (DNC) - A TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with models
- vecino - Discovering similar Git repositories
- enry - Insanely fast file based programming language detector.
- Naturalize - Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN , uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - ISM is a novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- go-git - A highly extensible Git implementation in pure Go.
- bblfsh - A self-hosted server for source code parsing
- engine - source{d}, a scalable and distributed data retrieval pipeline for source code
- minhashcuda - source{d}, to efficiently remove duplicates of repositories on nBOW model
- kmcuda - source{d}, to cluster and to search for nearest neighbors in dense space
- wmd-relax - source{d}, to find nearest neighbors at Word Mover's Distance - to find nearest repositories
- swivel-spark-prep - Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
- hercules - Calculates the lines burnout stats in a Git repository
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016)
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016)
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016)
- from language X to Y - The cache file Erik Bernhardsson collected for his awesome blog post
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people!
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories
- GitHub Java Corpus - The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150'000 Python ASTs
- 150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs
- card2code - This dataset contains the language to code datasets described in our paper: Latent Predictor Networks for Code Generation
- A lot of references and articles were taken from mast-group