internetarchive/fatcat

non-lowercase DOIs

bnewbold opened this issue · 1 comments

Fatcat has a general policy that DOIs should always be normalized and stored in lower-case. It turns out this has not actually been enforced at the API level, and the clean_doi() helper function in Python was not normalizing to lower-case, which has resulted in many non-lower-case release entities being created, many of which are likely duplicates.

The usual DOI importers (Crossref, Datacite) did lower-case, and the lookup API also lower-cases, which has minimized the scope of the problem, but there are still on the order of 134k duplicate records:

zcat release_extid.tsv.gz | cut -f3 | rg '[A-Z]' | pv -l | wc -l
139964

Here is an example of two release entities for the same work. The Pubmed-sourced import happened first, and resulted in a release with upper-case DOI. The Crossref import happened second (same day!) with lowercase DOI:

Fixing this could include multiple stages:

  • fix clean_doi() in python to lower-case DOIs
  • have API creation endpoint enforce lower-casing, at least for creation (eg, don't allow creation of entities if DOI is not lower-case, but don't clobber existing records)
  • update and/or merge existing entities

All non-lower-case DOIs in the current fatcat catalog have now been updated to be lower-case. This impacted about 140k release entities.

One part of cleanup from this will be the many duplicate DOIs that this introduced, but that can be handled as part of generic DOI de-duplication.

A remaining task is to strictly enforce DOI lower-casing in fatcat API daemon.