/cwts_covid

Pre-print:

Primary LanguageJupyter NotebookMIT LicenseMIT

cwts_covid

This repository contains the code CWTS uses to create internal databases to study scientific literature on COVID-19. This code is provided as is for anyone who would like to replicate or expand upon it.

The code in this repository allows you to do the following steps:

  • Take published lists of scientific publications on COVID-19 and create a relational database with them.
  • Query the Dimensions and Altmetrics APIs to get more data on these publications (you will need to use your own API keys for this).
  • Do some basic plotting of this data.

This workflow can be illustrated as follows:

Workflow

Data sources

For the moment, we consider publications from the following sources:

  • CORD19;
  • Dimensions;
  • WHO. This data source has been dropped as of July 2020 (it is already included in CORD19).

You will need to download these datasets and add them to a local folder in order to process them. We assume that you will have a local copy of the whole CORD19 dataset, and a csv file with publication metadata for Dimensions. Previous releases of the Dimensions list can be found in the datasets_input folder. Please also see the notebooks below for more details.

In the future, we might expand to more sources.

Steps

Create database

The relational schema we use to consolidate the data sources mentioned above is available as a SQL script (working at least on MySQL).

SQL schema

You can use the Notebook_1_SQL_database notebook to populate this database. This notebook allows you to insert data into a MySQL instance of your choice, where an empty database is assumed to exist with the above-mentioned schema. Alternatively, it allows you to export the relational data to Pandas tables.

An explanation on tables and identifiers

  • The pub table contains publications from all data sources. If you would like to work with publications coming exclusively from one data source, join it with the datasource table via the pub_datasource table.
  • The primary keys of all tables (pub_id, covid19_mtadata_id, dimensions_metadata_id, datasource_id) are not stable and are only internally consistent: if you create different versions of the database, they will likely differ.
  • In order to work with Dimensions and Altmetrics data, publication identifiers should be used. Please give preference to DOIs, then to PMIDs, then to PMCIDs, then arXiv IDs, then to Dimension IDs.
  • We removed publications which had no known identifier among these five options. Most of these, at the moment, only have Semantic Scholar IDs. We might integrate those in a future update.
  • The metadata tables contain fields which are specific to a datasource, and we considered potentially useful. They are only available for publications coming from that datasource.

Query Dimensions and Altmetrics

You can then query Dimensions and Altmetrics APIs using your own keys, using the Notebook_2_API_queries notebook. You can request access as a researcher here: https://www.dimensions.ai/scientometric-research.

Data analysis

Using the Notebook_3_metadata_overview and Notebook_4_API_data_overview notebooks, you can get an overview of some of the resulting metadata and data.

Replication of paper findings

Finally, there are three notebooks to help replicate at least part of the analysis in the accompanying paper (CITE preprint here):

The two citation network clustering solutions discussed in the paper, using both CORD-19 and external references, is also provided as a separate file. These results are generated using cluster.py. This may require installation of the development version of python-igraph, until the upcoming release (0.8.1) is out. We therefore also include the actual clustering results themselves.

Some steps in the analyses are not included here since they require proprietary data. They can be replicated by getting access to the data (see above) and following the steps detailed in the paper.

How to give feedback

Please open an issue, or propose changes using a Pull Request.

How to cite

@article {Colavizza2020.04.20.046144,
    author = {Colavizza, Giovanni and Costas, Rodrigo and Traag, Vincent A. and van Eck, Nees Jan and van Leeuwen, Thed and Waltman, Ludo},
    title = {A scientometric overview of CORD-19},
    elocation-id = {2020.04.20.046144},
    year = {2020},
    doi = {10.1101/2020.04.20.046144},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2020/04/20/2020.04.20.046144},
    eprint = {https://www.biorxiv.org/content/early/2020/04/20/2020.04.20.046144.full.pdf},
    journal = {bioRxiv}
}

Acknowledgements

We would like to thank Digital Science (Dimensions, Altmetrics) for their support and for making all their data available to us.