Parses the DBLP XML in order to gather all author info per ORCID, saving it to a single csv. You can also download the csv files directly from the repo: by orcid or by alias.
We use a Docker-based deployment. To run use the following command as an example:
# export by orcid info
docker run --rm -ti jfloff/dblp-orcids --out --orcids 1> by_orcid.csv
# export by alias info
docker run --rm -ti jfloff/dblp-orcids --out --alias 1> by_alias.csv
If you want to run on a standard environment, just install the requirements.pip
and run ./parse.py
The parse.py
script has a couple of options:
--out
[default=True] : Outputs csv to stdout. Useful for redirecting output. Redirect only stdout, since stderr has progress messages.--csv
[default=False] : Saves output to csv. Either 'by_orcid.csv' or 'by_alias.csv' according to below option.--orcid
: We gather by orcid, and list all alias for that orcid--alias
: We gather by alias, and list all orcids for that alias--no-download
[default=False] : Does not download DBLP XML files. Useful for development.
Note: when running please have in mind that the DBLP XML is large (more than 2GB). Even though we tried to improve memory management while parsing, it still requires a considerable amount.
The image jfloff/dblp-orcids
is already available through Docker Hub, but you can always clone this repo and build it yourself.
docker build --rm -t jfloff/dblp-orcids .
Here is a snippet to load the CSV into python pandas:
from ast import literal_eval
import pandas as pd
orcid_info=pd.read_csv('by_orcid.csv', comment='#', encoding='utf-8',
dtype={
# force dtypes: pandas with problems guessing types
'acm_id': object,
'scopus_id': object,
},
converters={
# parse lists of alias and dblp_keys to python object
'alias': lambda x: literal_eval(x),
'dblp_key': lambda x: literal_eval(x),
},
# optional: set index to orcid
index_col='orcid')
alias_info=pd.read_csv('by_alias.csv', comment='#', encoding='utf-8',
dtype={
# force dtypes: pandas with problems guessing types
'acm_id': object,
'scopus_id': object,
},
converters={
# parse lists of orcids (there might be multiple)
'orcid': lambda x: literal_eval(x),
},
# optional: set index to alias
index_col='alias')
The code in this repository, unless otherwise noted, is MIT licensed. See the LICENSE file in this repository. When using this repository do not forget to acknowledge DBLP.