Consider flexibility in rules for matching an article by title
Opened this issue · 3 comments
Background
The article matching has been iterated on many times for different edge cases: #1074 #1124 #848 and there are services aimed at resolving this information e.g. #1295. From my observations, this works pretty well, but there are cases where no article is matched, due to ambiguity.
Currently, an author's input title must be an exact subset of the record retrieved from either PubMed or CrossRef after 'sanitization':
- trimming: const trimmed = _.trim( raw , ' .')
- lower casing: const lower = _.toLower( trimmed )
- removal of non-words: const clean = lower.replace(/[\W_]+/g, ' ')
Problems observed
There remain cases where we might want to reasonably relax conditions. For example:
- Stop words
- Input: "Syntaxin-6 delays prion protein fibril formation and prolongs the presence of toxic aggregation intermediates"
- Actual: "Syntaxin-6 delays prion protein fibril formation and prolongs presence of toxic aggregation intermediates"
- Source: https://doi.org/10.7554/eLife.83320
- Formatting
- whitespace 1
- Input: "Senescent cells inhibit mouse myoblast differentiation via the Senescence Associated Secretory Phenotype ( SASP)-lipid 15d-PGJ2 -mediated modification and control of HRas"
- Actual (PubMed): "Senescent cells inhibit mouse myoblast differentiation via the SASP-lipid 15d-PGJ2 mediated modification and control of HRas."
- Source: https://pubmed.ncbi.nlm.nih.gov/39196610/
- whitespace 2
- Input: "Circular RNA HMGCS1 sponges MIR4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
- Actual (eLife): "Circular RNA HMGCS1 sponges miR-4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
- Source: https://doi.org/10.7554/eLife.97267.1
- whitespace 3
- Input: "Circular RNA HMGCS1 sponges miR-4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"
- Actual (update): "Circular RNA HMGCS1 sponges MIR4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction."
- Source: https://doi.org/10.7554/eLife.97267
- Author-specified info
- journal, year
- Input: "
eLife 2024: Defining cell type-specific immune responses in a mouse model of allergic contact dermatitis by single-cell transcriptomics"
- Input: "
- Actual: "Defining cell type-specific immune responses in a mouse model of allergic contact dermatitis by single-cell transcriptomics"
- Source: https://doi.org/10.7554/eLife.94698.3
- journal, year
- whitespace 1
- Markup
- Input: "Trans regulation of an odorant binding protein by a proto-Y chromosome affects male courtship in house fly"
- Actual: "Transregulation of an odorant binding protein by a proto-Y chromosome affects male courtship in house fly"
- Source: https://www.biorxiv.org/content/10.1101/2021.06.22.447776v2
- Partial-match
- title
- Input: "Root-specific theanine metabolism and regulation at the single-cell level in tea plants (Camellia sinensis)"
- Actual: "Root-specific secondary metabolism at the single-cell level: a case study of theanine metabolism and regulation in the roots of tea plants (Camellia sinensis)"
- Source: https://doi.org/10.7554/eLife.95891.2
- title
- Characters
- Dashes
- Input: "
Neurons enhance blood-–brain barrier function via upregulating claudin-5 and VE-cadherin expression due to glial cell line-derived neurotrophic factor secretion" - Actual: "Neurons enhance blood-brain barrier function via upregulating claudin-5 and VE-cadherin expression due to GDNF secretion"
- Source: https://elifesciences.org/reviewed-preprints/96161
- Input: "
- Dashes
Details
There are potential pitfalls to increasing flexibility, notably, the title of a manuscript can change between preprints, versions and the final version of record.
Tasks
- Collect additional cases of real/potential mismatches
- Create test harness
- Pull out common code for matching
Not too flexible: #1299 (comment)
FYI, CrossRef etc uses a lot of the same heuristics to match titles (for the purposes of matching VOR to preprint)
New case:
- Punctuation
- Apostrophe
- Input: "Saccharomyces cerevisiae Rev7 promotes non-homologous end-joining by blocking Mre11 nuclease and Rad50’s ATPase activities and homologous recombination"
- Actual (PubMed): "Saccharomyces cerevisiae Rev7 promotes non-homologous end-joining by blocking Mre11 nuclease and Rad50's ATPase activities and homologous recombination"
- Sources
- Apostrophe
There's probably some npm packages for normalizing stuff, but which ones and how far reaching is another question. So far I've avoided all of these.