Bibliometrics for software engineering conferences
Contents
This is a database of papers and programme committee members for software engineering conferences. It contains more than ten years of history for each of the following conferences:
- ICSE, International Conference on Software Engineering
- ICSM, IEEE International Conference on Software Maintenance
- ASE, IEEE/ACM International Conference on Automated Software Engineering
- FSE, ACM SIGSOFT Symposium on the Foundations of Software Engineering
- FASE, International Conference on Fundamental Approaches to Software Engineering
- MSR, Working Conference on Mining Software Repositories
- WCRE, Working Conference on Reverse Engineering
- CSMR, European Conference on Software Maintenance and Reengineering
- GPCE, Generative Programming and Component Engineering
- ICPC, IEEE International Conference on Program Comprehension
- SCAM, International Working Conference on Source Code Analysis & Manipulation
The data is stored in a MySQL database (see the SQL dump) with the following schema:
Alternatively, the database can be recreated (hence easily extended) from CSV files using Python and the SQLAlchemy Object Relational Mapper using the scripts included (more details below).
Data provenance
- Papers and authors: the DBLP data dump. Papers which were part of the main (research) track have been (manually) marked as such in the
main_track
column. The conference impact factor (theimpact
column from theconferences
table) is the SHINE h-index for the period 2000-2012. - Number of submssions: Tao Xie's software engineering conference statistics; foreword to proceedings.
- Composition of programme committee: conference websites, only programme committee members for the main tracks have been included. Disambiguation was performed to align the spelling used on the different websites to that found in DBLP.
In some cases the DBLP data also contains the session title(s) for a given paper. For example, for papers published at ICSE 2012, a session title (such as Technical Research
, originally encoded as an HTML h2
header and recorded in the session_h2
column) and a session subtitle (such as Fault Handling
, originally encoded as an HTML h3
header and recorded in the session_h3
column) is available. When available, such titles could be used to automatically filter papers if so desired for a certain bibliometric analysis.
Using the database
Directly
Most simply, you can import the SQL dump into your favourite database management system (tested on MySQL) and start querying.
Via Python
Alternatively, you can take a look at how the database was created using MySQL, Python and SQLAlchemy, and use these mechanisms also for querying. This will allow you to easily extend the database or update its schema.
Dependencies and installation instructions
If you take this path, make sure you have Python and a MySQL server installed before attempting anything. Thanks a lot to Leon Moonen for spelling out the exact steps (tested on his OS X 10.8.5 machine with Python 2.7.2).
- Install Unidecode:
easy_install Unidecode
- Install SQLAlchemy:
easy_install SQLAlchemy
- Make sure that mysql bin dir is in path (or next step will fail on mysql_config)
- Make sure that mysql lib dir is in dynamic library (or next step will fail on loading the library)
- Install MySQL-Python:
easy_install mysql-python
- Tweek
populateDB.py
for your particular MySQL user and password (the script assumes user root with empty password)
Python scripts
-
initDB.py
: declares the database schema using Python classes (will be automatically mapped to tables by SQLAlchemy). -
populateDB.py
: reads data about the papers and programme committees for each conference and loads it into the database. -
metrics.py
: defines a metrics model and how to compute the metrics. To account for the different ages of the conferences, we use sliding window metrics. For example,- author turnover RNA(c,y,k): fraction of authors at conference c in year y that have not been author between y-k and y-1.
- programme committee turnover RNC(c,y,k): fraction of PC of c in year y that have not served on the PC between y-k and y-1.
- inbreeding ratio RAC(c,y,k): fraction of papers published at c in year y co-authored by PC members from y-k to y.
For a complete list of metrics check this list, or see our Science of Computer Programming article.
-
queryDB.py
: queries the database, computes the metrics defined in the metrics model, and outputs the results to CSV files. For an example of a visualisation of these results, we include thevisualisation.r
R script that produces the following plot for RAC(c,y,0), the fraction of papers each year co-authored by PC members from that year.
Licenses
- The database is made available under the Open Database License
- Any rights in individual contents of the database (i.e., the data) are licensed under the Database Contents License
- The tooling (e.g., Python scripts and R scripts) used are licensed under the GNU Lesser General Public License version 3
Citation information
If you find the dataset or tooling useful in your research, please consider citing the following paper:
Bogdan Vasilescu, Alexander Serebrenik, and Tom Mens, "Mining software engineering conference data", in MSR '13: Proceedings of the 10th Working Conference on Mining Software Repositories, May 18-–19, 2013. San Francisco, California, USA; pages 373–376, ACM.
Additionally, if you're interested in a "health assessment" of software engineering conferences, consider reading our Science of Computer Programming paper:
Bogdan Vasilescu, Alexander Serebrenik, Tom Mens, Mark van den Brand, and Ekaterina Pek, "How healthy are software engineering conferences?", Science of Computer Programming 89, Part C, (2014), 251–272.