Improve process of identifying and updating article using the author-provided information (e.g. title)
Closed this issue · 0 comments
This is part documentation, part update.
How Biofactoid matches articles to author-provided information
Authors start using Biofactoid by entering their paper title, but we also accept a PubMed identifier (PMID) and Digital Object Identifier (DOI). The goal is to identify the paper, which in practice, means retrieving a matching record from an index, in this cases, PubMed or CrossRef. For a PMID or DOI, the process is trivial. For a title, the process is more complex, but can be summarized in the following table:
*Match item in PubMed | Match preprint in Crossref | Interpretation |
---|---|---|
✓ | ⛌ | Publication |
⛌ | ✓ | Preprint |
✓ | ✓ | **Preprint |
⛌ | ⛌ | Ambiguous |
*Match: Author-provided information is (1) substring of retrieved article title or (2) equal to article DOI or PMID
** If DOIs are equal use PubMed otherwise use most recent. Example: bioRxiv preprint in PubMed forwarded to eLife as a reviewed preprint. We desire the latter.
When author articles cannot be matched
An author's paper may not be matched for trivial reasons (incorrect information provided, spam) but this is rare. One important case is an accepted manuscript that is yet to be published, which, depending on the journal, can be on the order of months - see #1280.
CRON: Trying again
Our CRON currently runs once a week, and is tasked with updating article information. The update
works nearly identically to the process described above for finding and author's paper.
Minor updates
- CRON should use the author provided paper ID (i.e. title) to update the Document article metadata
- Currently it uses the PMID or DOI if they exist, effectively freezing the object
- The trade-off to this is an article could 'update' from preprint to publication, but I think this is the most desired behaviour
- Network errors should fall back to existing article metadata
- Currently, there's a bug where an article exists and is being updated but there's a network error, which ends up deleting the existing metadata
- Should keep the existing information