A curated list of software engineering research, data sets and tools. Inspired by awesome-msr project.
Collecting a list of papers with code implementations, which could be used as baselines.
- AnswerBot, Replication package of the paper "AnswerBot: An Answer Summary Generation Tool Based on Stack Overflow", ESEC/FSE 2019, Paper
- BIKER, The dataset and source code for paper "API Method Recommendation without Worrying About the Task-API Knowledge Gap", ASE, Paper, including about 400 API retrieval tasks from Stack Overflow.
- debug-method-name-2019-ICSE, A tool of spotting and refactoring inconsistent method names learned from real-world code bases. This work will be presented at ICSE 2019. This reincluding data with .
- DeepName-2021-ICSE, This repository contains the code and dataset for A Context-based Automated Approach for Method Name Consistency Checking and Suggestion.
- DocSmell Benchmark,benchmark dataset of 1000 documentations with these 5 types of smells.
- Chatbot4QR. Chatbot4QR: Interactive Query Refinement for Technical Question Retrieval. paper.
Collecting a list of data sets, benchmarks for different tasks.
- Code-LMs, include a 249GB multi-lingual code corpus used to train language model and some pretrained language model for code, e.g., GPT-2, PolyCoder. Paper.
- Toga. This repository contains the replication artifact for TOGA: A Neural Method for Test Oracle Generation to appear in ICSE 2022.
- Train Ticket:A Benchmark Microservice System. The project is a train ticket booking system based on microservice architecture which contains 41 microservices. This project is maintained by the CodeWidom team of Fudan University.
- facebook Neural-Code-Search-Evaluation-Dataset,Code Search Dataset from FaceBook: H. Li, S. Kim, and S. Chandra, “Neural code search evaluation dataset,” ArXiv, vol. abs/1908.09804, 2019
- codesearchnet, Code Search Dataset from Github & Microsoft: H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” ArXiv, vol. abs/1909.09436, 2019, github/codesearchnet.
- CosBench, 52 queries and corresponding answers from 4,199,769 java snippets.
- FOCUS, FOCUS is a context-aware collaborative-filtering system that exploits cross relationships among OSS projects to suggest the inclusion of additional API invocations and concrete API usage patterns. Paper: "FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns".
Authors: Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Lina Ochoa, Thomas Degueule, and Massimiliano Di Penta
- CodeXGLUE - A benchmark dataset and open challenge for code intelligence. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios: 1) code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation); 2) text-code (natural language code search, text-to-code generation); 3) code-text (code summarization); 4) text-text (documentation translation).
- Project CodeNet, The goal of Project CodeNet is to provide the AI-for-Code research community with a large scale, diverse, and high quality curated dataset to drive innovation in AI techniques. Project CodeNet is a large scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. Project CodeNet aims to do for AI for Code what ImageNet did for computer vision.
- PyCodeGPT, from paper CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation . Providing two benchmark: PandasEval and NumpyEval.
- EMSE Bug Location Data Set, from paper "Using Bug Descriptions to Reformulate Queries during Text-Retrieval-based Bug Localization" by Oscar Chaparro, Juan Manuel Florez, Andrian Marcus.
- DeepLocalize, DeepLocalize: Fault Localization for Deep Neural Networks. This repo includes a bechmark.
- TSSB-3M, mining tool and large-scale datasets of single statement bug fixes in Python.
- Paper: TSSB-3M: Mining single statement bugs at massive scale
- TSSB-3M: A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more than a single statement.
- SSB-9M: A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.
- SSC-28M: A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evolution. Therefore, a code change might not necessarily relate to a bug fix.
- MUBench, MUBench (pronounced "Moo Bench") is an automated benchmark for API-misuse detectors, based on the MUBench benchmarking dataset. If you encounter any problems using MUBench, please report them to us. If you have any questions, please contact Sven Amann.
- CryptoAPI-Bench: A Comprehensive Benchmark on Java Cryptographic API Misuses
- great, ICLR20-Great,, the dataset for the variable-misuse task, described in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]. This repository contains the data and code to replicate our ICLR 2020 paper on models of source code that combine global and structural information, including the Graph-Sandwich model family and the GREAT (Graph-Relational Embedding Attention Transformer) model.
- PLUR, PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.
- BIKER, The dataset and source code for paper "API Method Recommendation without Worrying About the Task-API Knowledge Gap", ASE, Paper, including about 400 API retrieval tasks from Stack Overflow.
- Deny Benchmark on Feature Location, from JSEP 2013 "Feature Location in Source Code-A Taxonomy and Survey". It provides ArgoUML,Eclipse,JabRef ,jEdit ,muCommander feaure location data sets.
- Java-Annotation-Study, This repository contains our code for studying Java annotation and its evolution, the collected large scale data about evolution of annotations in three years for each project, and our manual analysis of the characteristics of annotation evolution.
- TSSB3M, including three dataset for python code.
- TSSB-3M: A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more than a single statement.
- SSB-9M: A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.
- SSC-28M: A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evoluation. Therefore, a code change might not necessarily relate to a bug fix.
- nbfbaselines, Neural baselines for finding and fixing single token bugs in Python, paper, "Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes".
- Code Diff Datasets, A collection of diff datasets. It contains: Defects4J, BugsInPy, and unparsable.
- Stack Exchange - Anonymized dump of all user-contributed content on the Stack Exchange network.
- AwesomeList, awesome lists data in json format.
Tools that could be used in SE research
- PyDriller - Python Framework to analyse Git repositories.
- GrimoireLab - Toolset for software development analytics. By far the best set of tools to mine software repositories.
- code_diff, Fast AST based code differencing in Python
- code_tokenize, Fast tokenization and structural analysis of any programming language
- msrWS - Tutorials about how to mine github repository, including some pdfs.
- awesome-machine-learning-on-source-code, Cool links & research papers related to Machine Learning applied to source code (MLonCode)
- CUHK-ARISE/ml4code-dataset, a collection of datasets for machine learning for big code.
- SE4I
- huggingface dataset
- Papers using Stack Overflow data
- Open Research Datasets in Software Engineering
- This list requires your input for its continuous improvement. Read the contribution guide for instructions on how you can contribute. Alternatively, you can send me an email if you find the process too cumbersome or confusing.
- For more awesome lists, see awesome.