
A curated repository of software engineering repository mining data sets

Creative Commons Zero v1.0 UniversalCC0-1.0

awesome-msr Awesome

A curated repository of data sets and tools that can be used for data-driven empirical software engineering, a method also known as mining software repositories (MSR). For examples of such work see the Mining Software Repositories conference.

This list is under construction and requires your input. Please contribute additions through a GitHub pull request. (Or send me an email if you find that too cumbersome.) The additions you specify should be genuinely useful to MSR researchers; the objective of this list is utility rather than comprehensiveness.

For more awesome lists, see awesome.

Data Sets

  • AndroZoo - a growing collection of Android Applications
  • Boa - a domain-specific language and infrastructure that eases mining software repositories
  • Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse
  • Enron Spreadsheets and Emails - all the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'
  • Findbugs-maven - a set of FindBugs reports for the Java projects of the Maven repository
  • GHTorrent - an effort to create a scalable, queriable, offline mirror of data offered through the Github REST API
  • Maven metrics - a collection of software complexity & sizing metrics for the Maven Repository
  • mzdata - Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History
  • RepoReapers Data Set - A data set containing a collection of engineered software projects from GHTorrent.
  • Stack Exchange - an anonymized dump of all user-contributed content on the Stack Exchange network.
  • tera-PROMISE - a research dataset repository specializing in software engineering research datasets. This has now been moved to Zenodo and is called SeaCraft
  • TravisTorrent - TravisTorrent provides free and easy-to-use Traivs CI build analyses.
  • Unix history - a Git repository with 46 years of Unix history evolution


  • ckjm - Chidamber and Kemerer Java Metrics
  • MetricMiner - a lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories
  • qmcalc - calculate quality metrics from C source code
  • reaper - A Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.



To the extent possible under law, Diomidis Spinellis has waived all copyright and related or neighboring rights to this work.