/ml4code-dataset

A collection of datasets for machine learning for big code

MIT LicenseMIT

A Collection of Datasets for Big Code Analysis

A collection of datasets (and other resources) for big code analysis.

If you want to contribute to this list, please send a pull request.

Datasets

Name Description Tag Language Link
CodeSearchNet Dataset and benchmarks for code retrieval using natural language Code Retrieval, NLP Multiple (Python) link
PY150 150k Python programs and corresponding abstract syntax trees, released by OOPSLA'16 Probabilistic Model for Code with Decision Trees General Python link
OJ-104 Code from a Online Judge System, consisting of 104 classes of C programs, released by AAAI'16 Convolutional Neural Networks over Tree Structures for Programming Language Processing. Code Classification, Clone Dectetion C link, also used in ASTNN
code2seq Datset released by the ICLR paper code2vec, code2seq, etc. Code Completion Java, C# link
BigCloneBench BigCloneBench is a clone detection benchmark of known clones in the dataset source repository. Clone Dectetion Java link
Google Code Jam Projects collected from Google Code Jam competition. Clone Dectetion Java link
CodeChef Program classification dataset released by kaggle Code Classification Java link
OOPSLA19Li Datset released by the OOPSLA'19 Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks Bug Detection Java link
Devign Dataset released by NeurIPS'19 Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks Vulnerability Identification C link
Draper The dataset consists of the source code of 1.27 million functions mined from open source software, labelled by static analysis for potential vulnerabilities. The dataset is released by ICMLA'18 Automated Vulnerability Detection in Source Code Using Deep Representation Learning Vulnerability Identification C link
VulDeePecker Semantics-based Vulnerability Candidate (SeVC) dataset. Dataset released by NDSS'18 VulDeePecker: A Deep Learning-Based System for Vulnerability Detection Vulnerability Detection C/C++ link
SySeVR The Semantics-based Vulnerability Candidate (SeVC) dataset released by arXiv'18 SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities Vulnerability Detection C link
Seahymn Vulnerable functions from 9 open-source software projects Vulnerability Detection C link
Big-Vul A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries Vulnerability Detection C/C++ link
RAISE19Ferenc Dataset released by RAISE'19 Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions Vulnerability Detection JavaScript link
D2A Differential Analysis Dataset released by ICSE-SEIP'21 paper D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis Vulnerability Detection C/C++ link
TypeWriter Dataset released by FSE'20 TypeWriter: Neural Type Prediction with Search-based Validation Type Inference Python link
DeepTyper Dataset released by FSE'18 Deep Learning Type Inference Type Inference JavaScript link
Typlus Dataset released by PLDI'20 paper Typilus: Neural Type Hints Type Inference Python link

Resources