awesome-msr

A curated repository of data sets and tools that can be used for data-driven empirical software engineering, a method also known as mining software repositories (MSR). For examples of such work see the Mining Software Repositories conference. Many of the data set can also be useful in research using search-based software engineering.

This list requires your input for its continuous improvement. Read the contribution guide for instructions on how you can contribute. Alternatively, you can send me an email if you find the process too cumbersome or confusing.
For more awesome lists, see awesome.

Data Sets
Tools

Data Sets

AndroZoo - Collection of Android Applications
Bug Prediction Dataset - Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories
Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse
CoREBench - Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils
Defects4J - Collection of 395 reproducible bugs collected with the goal of advancing software testing research
Enron Spreadsheets and Emails - All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'
Findbugs-maven - Set of FindBugs reports for the Java projects of the Maven repository
GHTorrent - Scalable, queriable, offline mirror of data offered through the Github REST API
GitHub on Google BigQuery - GitHub data accessible through Google's BigQuery platform
KaVE - Developer tool interaction data
Maven metrics - Collection of software complexity & sizing metrics for the Maven Repository
mzdata - Multi-extract and multi-level dataset of Mozilla issue tracking history
OCL Expressions on GitHub - Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
RepoReapers Data Set - Data set containing a collection of engineered software projects from GHTorrent.
SIR - Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data
STAMINA - (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs)
Stack Exchange - Anonymized dump of all user-contributed content on the Stack Exchange network.
tera-PROMISE - Research dataset repository specializing in software engineering research datasets.
TravisTorrent - Provides free and easy-to-use Traivs CI build analyses.
Unix history - Git repository with 46 years of Unix history evolution
UML in OSS - More than 93,000 UML files (collected from more than 24,000 GitHub repositories)
Zenodo - operated by CERN, contains several collections about software data:
- Software Engineering Artifacts Can Really Assist Future Tasks
- Empirical Software Engineering
- Mining Software Repositories

Tools

ckjm - Chidamber and Kemerer Java Metrics
MetricMiner - Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories
qmcalc - Calculate quality metrics from C source code
reaper - Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.
Boa - Domain-specific language and infrastructure that eases mining software repositories
Diggit - Agile Ruby Tool to analyze Git repositories
GrimoireLab: Free/Libre/Open Source tools for Software Development Analytics

License

To the extent possible under law, Diomidis Spinellis has waived all copyright and related or neighboring rights to this work.

iivanoo/awesome-msr

awesome-msr

Contents

Data Sets

Tools

License