
Exploring the Data Citation Corpus

Primary LanguagePHP

DataCite Citation Corpus

Exploring the Data Citation Corpus.

A trusted central aggregate of all data citations to further our understanding of data usage and advance meaningful data metrics.

The data is available by request, this repository documents my explorations of the data.

Data extraction

to_sql.php parses the JSON data and outputs SQL statements so we can construct a simple SQL database to explore the data.


The data comes from two sources, DataCite and the Chan Zuckerberg Initiative (CSI):

SELECT COUNT(id), sourceId FROM citation GROUP BY sourceId;


The data repositories that are cited are identified by local UUIDs, so it is non-trivial to figure out which repository is which. Here is a list so far:

Repository id citation count
GenBank 00363b65-f3ef-4fa9-8255-23ab269f4930 3755354
PDB 87646104-e5ef-494b-b2f3-a46c9572e003 1729783
SNP 6087b2e9-ecbf-4898-8047-5f484f1bce2f 890431


Publishers of the journals in which citations were found are also identified using UUIDs, the top twenty of these are listed below, and names can be determined by comparing to the chart on the corpus dashboard.

name citation count publisherId
2136164 e566bc45-b8bc-430c-ab2c-9c224e1c6f21
1029617 ec75ceb1-215c-4376-aa1c-4b39d15dc069
938870 9ead11e4-bd7d-4c91-aff0-cb962676520a
768385 bf7ba43c-7a3e-43e3-a9c2-6ed5b6fb6303
704427 08d58a61-189f-4316-892b-908a1832603d
635059 babceab8-4440-4c65-ad12-24784190dbae
315654 602471f4-3d02-45f7-9d59-661471761299
312135 af7d8efb-1a44-4a02-9d5b-29ceb6878117
276263 37fa820b-d158-43b4-8f67-e0c2f7364d35
199813 55506166-9f8d-4685-967d-c71c7af956b7
171526 21c1aa14-7ac4-4ccb-8fdc-8f7e3ab047a9
147908 2189510e-6e8f-410c-bf2a-a92319d51b0e
114627 faca9ac2-2c88-4277-acdd-0a1177c10094
98882 deba021e-5d63-48af-82b5-673c6507a03e
97239 dba2ef73-893b-4c93-9123-ea3429d6c983
92100 cfd487dd-9342-49ec-b93a-a044da079368
90016 bd7beb5b-5e4d-4c9f-b99d-944bc8cd5bf3
Pensoft 80907 9d72fbd4-0a14-4ee8-bac5-75ec06ababf7
80376 c6e65534-0e8c-495f-99ed-04ee78761d3c
60503 d2c56596-551e-4f1e-81e6-d7bafe1670f8

Protein Data Bank

The Protein Data Bank (PDB) has 1,729,783 citations in the corpus. There are 177,220 distinct PDB identifiers cited.

FROM citation 
WHERE repositoryId = '87646104-e5ef-494b-b2f3-a46c9572e003';

Running these through pub_clean.php 31,612 (18%) do not match the PDB pattern /^[0-9][A-Za-z0-9]{3}$/.

I downloaded a list of all PDB identifiers from https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz, and then loaded those identifiers into the table identifier.

SELECT COUNT(id) FROM identifier WHERE namespace='pdb'; 

There are 216,225 distinct PDB identifiers.

We can compare the PDB identifiers in the corpus with the actual PDB identifiers:

SELECT COUNT(citation.id) FROM citation 
INNER JOIN identifier 
ON UPPER(citation.subjId) = identifier.id
WHERE repositoryId = '87646104-e5ef-494b-b2f3-a46c9572e003' AND identifier.namespace = 'pdb';

This finds 1,233,993 PDB identifiers, which is 71% of the total in the corpus. Hence a little under a third of the PDB citations appear to be erroneous.

We can look at some mistaken identifiers:

SELECT citation.id, UPPER(citation.subjId), identifier.id 
FROM citation 
LEFT JOIN identifier ON UPPER(citation.subjId) = identifier.id 
WHERE repositoryId = '87646104-e5ef-494b-b2f3-a46c9572e003' 
AND citation.subjId LIKE "1%"
AND identifier.id IS NULL
LIMIT 100;

Repository identifiers


Label Methodology http://identifiers.org
arrayexpress https://identifiers.org/arrayexpress:dataset Y
biomodels https://identifiers.org/biomodels.db:dataset Y
bioproject https://identifiers.org/bioproject:dataset Y
biosample https://identifiers.org/biosample:dataset Y
biostudies https://identifiers.org/biostudies:dataset Y
cath https://identifiers.org/cath:dataset Y
chebi https://identifiers.org/chebi:dataset[6:] Y
chembl https://identifiers.org/chembl:dataset Y
complexportal https://identifiers.org/complexportal:dataset Y
dbgap https://identifiers.org/dbgap:dataset Y
doi https://dx.doi.org/:dataset sometimes
ebisc https://cells.ebisc.org/dataset N
efo https://identifiers.org/efo:dataset Y
ega https://identifiers.org/ega.dataset:dataset Y
emdb https://identifiers.org/emdb:dataset Y
empiar https://identifiers.org/empiar:dataset Y
ensembl https://identifiers.org/ensembl:dataset Y
gca https://identifiers.org/insdc.gca:dataset Y
gen https://identifiers.org/ena.embl:dataset Y
geo https://identifiers.org/geo:dataset Y
gisaid http://gisaid.org/EPI/dataset N
go https://identifiers.org/go:dataset Y
hgnc https://identifiers.org/hgnc:dataset Y
hipsci http://www.hipsci.org/lines/#/lines/dataset N
hpa https://identifiers.org/hpa:dataset Y
igsr https://identifiers.org/coriell:dataset Y
intact https://identifiers.org/intact:dataset Y
interpro https://identifiers.org/interpro:dataset Y
metabolights https://identifiers.org/metabolights:dataset Y
metagenomics https://identifiers.org/mgnify.samp:dataset Y
mint https://identifiers.org/mint:dataset Y
omim https://identifiers.org/mim:dataset Y
orphadata https://identifiers.org/orphanet:dataset Y
pdb https://identifiers.org/pdb:dataset Y
pfam https://identifiers.org/pfam:dataset Y
pxd https://identifiers.org/pride:dataset Y
reactome https://identifiers.org/reactome:dataset Y
refseq https://identifiers.org/refseq:dataset Y
refsnp https://identifiers.org/dbsnp:dataset Y
rfam https://identifiers.org/rfam:dataset Y
rnacentral https://identifiers.org/rnacentral:dataset Y
rrid https://identifiers.org/rrid:dataset Y
treefam https://identifiers.org/treefam:dataset Y
uniparc https://identifiers.org/uniparc:dataset Y
uniprot https://identifiers.org/uniprot:dataset Y