
A collection of publicly available bug reports


BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main data source for facilitating NLP-based research in software engineering. We categorize the datasets into the following research directions.

1. Duplicate bug idenfication

Project Timespan #Components #Issues #Issue/day #Duplicates %Duplicates Median Resolving Time
Mozilla Core 1997/03/28 ~ 2013/12/31 130 205,069 33.5 44,691 21.8% 102.1 days
Firefox 1999/07/30 ~ 2013/12/31 52 115,814 22.0 35,814 30.9% 76.4 days
Thunderbird 2000/04/12 ~ 2013/12/31 23 32,551 6.5 12,501 38.4% 83.7 days
Eclipse Platform 2001/10/10 ~ 2013/12/30 21 85,156 19.1 14,404 16.9% 29.8 days
JDT 2001/10/10 ~ 2013/12/31 6 45,296 10.1 7,688 17.0% 23.0 days
Spark 2010/04/01 ~ 2018/01/10 29 22,639 8.0 3,077 13.6% 7.1 days
Hadoop 2005/07/24 ~ 2017/11/01 45 12,855 2.9 1,861 14.5% 14.3 days
MapReduce 2006/03/17 ~ 2018/01/15 63 7,019 1.6 977 13.9% 28.2 days
HDFS 2006/04/06 ~ 2018/01/12 71 12,779 3.0 1,659 13.0% 9.7 days
HBase 2007/02/27 ~ 2018/01/21 95 19,788 5.0 1,340 6.8% 6.8 days
Cassandra 2009/03/07 ~ 2018/01/21 24 14,071 4.3 2,083 14.8% 8.6 days
Mesos 2011/02/16 ~ 2018/01/26 40 8,454 3.3 800 9.5% 23.5 days

Train/test data splitting: We split each dataset into 80%, 20% according to the chronological order as train/test data respectively.

Project Total (+/-) Train (+/-) Test (+/-)
Mozilla Core 205,069 (54,237/150,832) 164,055 (50,122/113,933) 41,014 (4,115/36,899)
Firefox 115,814 (34,262/81,552) 92,651 (30,026/62625) 23,163 (4,236/18,927)
Thunderbird 32,551 (11,631/20,920) 26,040 (10,046/15,994) 6,511 (1,585/4,926)
Eclipse Platform 85,156 (19,845/65,311) 68,124 (17,518/50,606) 17,032 (2,327/14,705)
JDT 45,296 (10,127/35,169) 36,236 (8,859/27,377) 9,060 (1,268/7,792)
Spark 19,766 (2,813/16,953) 15,812 (2,425/13,387) 3,972 (388/3,566)
Hadoop 10,624 (827/9,797) 8,499 (656/7,843) 2,125 (171/1,954)
MapReduce 5,608 (880/4,728) 4,486 (779/3,707) 1,122 (101/1,021)
HDFS 10,676 (1,530/9,146) 8,540 (1,398/7,142) 2,136 (132/2,004)
HBase 16,594 (455/16,139) 13,275 (384/12,891) 3,319 (71/3,248)
Cassandra 11,950 (1,261/10,689) 9,560 (962/8,598) 2,390 (299/2,091)
Mesos 6,564 (615/5,949) 5,251 (535/4,716) 1,313 (80/1,233)

Links to more duplicate bug report datasets


2. Bug localization

Bug localization is a process to map a bug report to the corresponding buggy source file. This dataset contains bug reports, commit history, and API descriptions of six open source Java projects including Eclipse Platform UI, SWT, JDT, AspectJ, Birt, and Tomcat. The dataset is currently available here.

Project Timespan #Bugs mapped
AspectJ 2002-03-13 ~ 2014-01-10 593
Birt 2005-06-14 ~ 2013-12-19 4,178
Eclipse 2001-10-10 ~ 2014-01-17 6,495
JDT 2001-10-10 ~ 2014-01-14 6,274
SWT 2002-02-19 ~ 2014-01-17 4,151
Tomcat 2002-07-06 ~ 2014-01-18 1,056


3. Bug triaging

Given a software bug report, bug triaging is the process to identify an appropriate developer who could fix the bug. Automatic bug triaging algorithm can be formulated as a classification problem, which takes the bug title and description as the input, mapping it to one of the available developers (class labels). The dataset is currently available here.

Project #Bugs #Bugs for classifier
Chromium 383,104 118,643
Mozilla Core 314,388 128,215
Firefox 162,307 24,214


4. Bug-fixing time estimation

The bug report datasets hosted in this repository contain detailed information about bug fixing time tracking, which can thus be used for research on bug-fixing time estimation.


5. Bug information mining

Lamkanfi et al. [MSR'13] contributed a dataset with over 200.000 reported bugs extracted from the Eclipse and Mozilla projects. Besides providing a single snapshot of a bug report, they also include all the incremental modifications as performed during the lifetime of the bug report. The dataset is currently available here.

Project #Components #Bugs
Eclipse Platform 22 24,775
JDT 6 10,814
CDT 20 5,640
GEF 5 5,655
Mozilla Core 137 74,292
Firefox 47 69,879
Thunderbird 23 19,237
Bugzilla 21 4,616


  • [MSR'13] Ahmed Lamkanfi and Javier Perez and Serge Demeyer. The Eclipse and Mozilla Defect Tracking Dataset: a Genuine Dataset for Mining Bug Information. International Working Conference on Mining Software Repositories (MSR), 2013.


The datasets are freely available for research purposes.

LogPAI Team, 2018.