/awesome-msr

A curated repository of software engineering repository mining data sets

Creative Commons Zero v1.0 UniversalCC0-1.0

awesome-msr Awesome

A curated repository of data sets and tools that can be used for data-driven empirical software engineering, a method also known as mining software repositories (MSR). For examples of such work see the Mining Software Repositories conference. Many of the data set can also be useful in research using search-based software engineering.

  • This list requires your input for its continuous improvement. Read the contribution guide for instructions on how you can contribute. Alternatively, you can send me an email if you find the process too cumbersome or confusing.
  • For more awesome lists, see awesome.

Contents

Data Sets

  • AndroZoo - Collection of Android Applications
  • Bug Prediction Dataset - Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories
  • Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse
  • CoREBench - Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils
  • Defects4J - Collection of 395 reproducible bugs collected with the goal of advancing software testing research
  • Enron Spreadsheets and Emails - All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'
  • Findbugs-maven - Set of FindBugs reports for the Java projects of the Maven repository
  • GHTorrent - Scalable, queriable, offline mirror of data offered through the Github REST API
  • GitHub on Google BigQuery - GitHub data accessible through Google's BigQuery platform
  • KaVE - Developer tool interaction data
  • Maven metrics - Collection of software complexity & sizing metrics for the Maven Repository
  • mzdata - Multi-extract and multi-level dataset of Mozilla issue tracking history
  • OCL Expressions on GitHub - Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
  • RepoReapers Data Set - Data set containing a collection of engineered software projects from GHTorrent.
  • SIR - Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data
  • STAMINA - (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs)
  • Stack Exchange - Anonymized dump of all user-contributed content on the Stack Exchange network.
  • tera-PROMISE - Research dataset repository specializing in software engineering research datasets.
  • TravisTorrent - Provides free and easy-to-use Traivs CI build analyses.
  • Unix history - Git repository with 46 years of Unix history evolution
  • UML in OSS - More than 93,000 UML files (collected from more than 24,000 GitHub repositories)
  • Zenodo - operated by CERN, contains several collections about software data:

Tools

  • ckjm - Chidamber and Kemerer Java Metrics
  • MetricMiner - Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories
  • qmcalc - Calculate quality metrics from C source code
  • reaper - Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.
  • Boa - Domain-specific language and infrastructure that eases mining software repositories
  • Diggit - Agile Ruby Tool to analyze Git repositories
  • GrimoireLab: Free/Libre/Open Source tools for Software Development Analytics

License

CC0

To the extent possible under law, Diomidis Spinellis has waived all copyright and related or neighboring rights to this work.