/awesome-machine-learning-on-source-code

Interesting links & research papers related to Machine Learning applied to source code

Creative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

Awesome Machine Learning On Source Code Awesome Machine Learning On Source Code

A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.

Contents

Digests

Conferences

Papers

Program Synthesis and Induction

Source Code Analysis and Language modeling

Neural Network Architectures and Algorithms

Program Translation

Code Suggestion and Completion

Program Repair and Bug Detection

APIs and Code Mining

Code Optimization

Topic Modeling

Code Summarization

Clone Detection

Differentiable Interpreters

Binary Data Modelling

Posts

Talks

Software

Machine Learning

  • Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
  • sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
  • vecino - Finds similar Git repositories.
  • apollo - Source code deduplication as scale, research.
  • gemini - Source code deduplication as scale, production.
  • enry - Insanely fast file based programming language detector.
  • Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
  • Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
  • Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
  • Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
  • Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
  • TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
  • JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

Utilities

  • go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
  • hercules - Git repository mining framework with batteries on top of go-git.
  • bblfsh - Self-hosted server for source code parsing.
  • engine - Scalable and distributed data retrieval pipeline for source code.
  • minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
  • kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
  • wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.

Datasets

Credits

  • A lot of references and articles were taken from mast-group

Contributions

See CONTRIBUTING.md.

License

License: CC BY-SA 4.0