PathwayCommons/factoid

Improve process of identifying and updating article using the author-provided information (e.g. title)

Closed this issue · 0 comments

This is part documentation, part update.

How Biofactoid matches articles to author-provided information

Authors start using Biofactoid by entering their paper title, but we also accept a PubMed identifier (PMID) and Digital Object Identifier (DOI). The goal is to identify the paper, which in practice, means retrieving a matching record from an index, in this cases, PubMed or CrossRef. For a PMID or DOI, the process is trivial. For a title, the process is more complex, but can be summarized in the following table:

*Match item in PubMed Match preprint in Crossref Interpretation
Publication
Preprint
**Preprint
Ambiguous

*Match: Author-provided information is (1) substring of retrieved article title or (2) equal to article DOI or PMID
** If DOIs are equal use PubMed otherwise use most recent. Example: bioRxiv preprint in PubMed forwarded to eLife as a reviewed preprint. We desire the latter.

When author articles cannot be matched

An author's paper may not be matched for trivial reasons (incorrect information provided, spam) but this is rare. One important case is an accepted manuscript that is yet to be published, which, depending on the journal, can be on the order of months - see #1280.

CRON: Trying again

Our CRON currently runs once a week, and is tasked with updating article information. The update works nearly identically to the process described above for finding and author's paper.

Minor updates

  • CRON should use the author provided paper ID (i.e. title) to update the Document article metadata
    • Currently it uses the PMID or DOI if they exist, effectively freezing the object
    • The trade-off to this is an article could 'update' from preprint to publication, but I think this is the most desired behaviour
  • Network errors should fall back to existing article metadata
    • Currently, there's a bug where an article exists and is being updated but there's a network error, which ends up deleting the existing metadata
    • Should keep the existing information

Refs #1211, #1201