We release the datasets we used to study real-world access-denied issues. Some of the findings are summarized in the paper,
T. Xu, H. M. Naing, L. Lu, and Y. Zhou, How Do System Administrators Resolve Access-Denied Issues in the Real World? Proceedings of the 35th Annual CHI Conference on Human Factors in Computing Systems (CHI'17), May 6-11, 2017, Denver, CO, USA. [Download]
Please cite the paper if you use the datasets :-)
The dataset contains 486 cases reported upon three software projects: Apache HTTP server (http://httpd.apache.org/), MySQL (https://www.mysql.com/), Hadoop (https://hadoop.apache.org/), and CentOS (https://www.centos.org/).
Software | #cases | period |
---|---|---|
Apache | 126 | 2001--2016 |
MySQL | 117 | 1999--2016 |
Hadoop | 101 | 2009--2016 |
CentOS | 142 | 2005--2016 |
We do not consider version in this datasets, which means a case could be from any version of the software.
Please refer to the paper for details. Basically, all the cases are collected from mailing list archives and Q&A forums the studied software, including
-
ServerFault (http://serverfault.com/)
-
StackOverflow (http://stackoverflow.com/)
-
Database Administrators (http://dba.stackexchange.com/)
-
CentOS Forums (https://www.centos.org/forums/)
-
Apache HTTP Server Users Mailing Lists (users@httpd.apache.org)
-
MYSQL General List (mysql@lists.mysql.com)
-
Hadoop User Mailing Lists (user@hadoop.apache.org)
-
CentOS Mailing Lists (https://lists.centos.org/pipermail/centos/)
We crawled/downloaded the entire mailing list archives and online posts with specific tags from the first mail/post to the most recent ones at the collection period. The collection is conducted during January to March in 2016. We automatically parsed the entire mailing list archives and all the posts on the above sources via a number of filter pipelines. At the end of the pipeline, we manually collect the files from the sources.
The data files are all in the CSV format (with a title row). If you want to take a close study, I suggest to load the CSV files into spreadsheet such as Google Sheets which dramatically improve the readability. In fact, the data are originally maintained in our own Google Sheets with colors and marks.
If you are not clear about some columns or abbreviations, you can post an issue on this GitHub repo and I will answer there.
Note: the CSV data file is only a part of our spreadsheet in the study (we removed certain fields we used for cross-validation, notes, and scratches).