ORCIDS from DBLP

Parses the DBLP XML in order to gather all author info per ORCID, saving it to a single csv. You can also download the csv files directly from the repo: by orcid or by alias.

Run

We use a Docker-based deployment. To run use the following command as an example:

# export by orcid info
docker run --rm -ti jfloff/dblp-orcids --out --orcids 1> by_orcid.csv
# export by alias info
docker run --rm -ti jfloff/dblp-orcids --out --alias 1> by_alias.csv

If you want to run on a standard environment, just install the requirements.pip and run ./parse.py

The parse.py script has a couple of options:

--out [default=True] : Outputs csv to stdout. Useful for redirecting output. Redirect only stdout, since stderr has progress messages.
--csv [default=False] : Saves output to csv. Either 'by_orcid.csv' or 'by_alias.csv' according to below option.
--orcid : We gather by orcid, and list all alias for that orcid
--alias : We gather by alias, and list all orcids for that alias
--no-download [default=False] : Does not download DBLP XML files. Useful for development.

Note: when running please have in mind that the DBLP XML is large (more than 2GB). Even though we tried to improve memory management while parsing, it still requires a considerable amount.

The image jfloff/dblp-orcids is already available through Docker Hub, but you can always clone this repo and build it yourself.

docker build --rm -t jfloff/dblp-orcids .

Read CSV

Here is a snippet to load the CSV into python pandas:

from ast import literal_eval
import pandas as pd

orcid_info=pd.read_csv('by_orcid.csv', comment='#', encoding='utf-8',
                dtype={
                    # force dtypes: pandas with problems guessing types
                    'acm_id': object,
                    'scopus_id': object,
                },
                converters={
                    # parse lists of alias and dblp_keys to python object
                    'alias': lambda x: literal_eval(x),
                    'dblp_key': lambda x: literal_eval(x),
                },
                # optional: set index to orcid
                index_col='orcid')

alias_info=pd.read_csv('by_alias.csv', comment='#', encoding='utf-8',
                dtype={
                    # force dtypes: pandas with problems guessing types
                    'acm_id': object,
                    'scopus_id': object,
                },
                converters={
                    # parse lists of orcids (there might be multiple)
                    'orcid': lambda x: literal_eval(x),
                },
                # optional: set index to alias
                index_col='alias')

License

The code in this repository, unless otherwise noted, is MIT licensed. See the LICENSE file in this repository. When using this repository do not forget to acknowledge DBLP.