/pubstats

Primary LanguagePython

Publication statistics

This repository establishes simple statistics for a set of conferences.

Using the DBLP data set, we extract the top conferences and then aggregate them on per-author basis. Based on different sub groups (e.g., security, embedded systems, or OS) we then calculate per author statistics in a nice overview.

Processing happens in two stages:

  • parse_dblp.py extracts all publications and dumps them in a pickle files based on the per-area aggregation (this is slow as DBLP is a 3GB XML file). To be able to process such a large XML file, we use a stream processor that simply dumps interesting publications into Pub objects (see pubs.py).
  • top_authors.py leverages the pickle files to process per-area statistics and aggregate statistics.
  • author_cliques leverages the pickle files to calculate per-area author
  • cliques.

Using/Howto

  • Easy mode: check out the homepage
  • make all to download DBLP data, pickle, and create the html data
  • make fresh to update DBLP data and pickle it
  • make topauthors to create the top author pages
  • make cliques to create the cliques

Contributing

Ideas, comments, or improvements are welcome! Please reach out to Mathias Payer to discuss. You can also reach out to @gannimo on Twitter.

Changelog

  • 2023-08-21 random bugfixes and conference updates
  • 2023-02-06 adjusted SE/DB conferences based on feedback
  • 2021-02-09 fixed VLDB conference and added ICDE and PODS for the database community; added ASE and ISSTA for the software engineering community
  • 2021-01-11 added HPCA for architecture and adjusted paper length calculation for DAC
  • 2021-01-09 remove tutorials and short papers (by parsing pages data)
  • 2021-01-05 figures for overview page
  • 2021-01-04 new overview table across areas
  • 2021-01-02 added author cliques
  • 2020-12-30 first version with author statistics

Acknowledgements

This code and page was developed by Mathias Payer, initially over the 2020 holiday break. The site includes feedback and suggestions from too many to list, thank you for that!

We use information from DBLP and CSRankings for anti-aliasing of authors. The idea for the statistics was inspired by Davide's Software Security Circus.

License

All data in this repository is licensed under CC BY-NC-ND 4.0.