hugheylab/pmparser

Support PubMed IDs for NCBI Bookshelf records

Closed this issue · 7 comments

We noticed some PubMed records that exist online, but aren't in PMDB:

They appear to all be PubMed IDs for a corresponding NCBI Bookshelf record.

Have you ever encountered these and does it make sense to include the ones with pubmed IDs as PMDB records?

I've run into them in the past: see manubot/manubot#298, and will post any additional information I come across like how to retrieve the complete list of pmids for bookshelf records. Also nothing the publication that describes them:

NCBI Bookshelf: books and documents in life sciences and health care
Marilu A. Hoeppner
Nucleic Acids Research (2012-11-29) https://doi.org/ghbhpc
DOI: 10.1093/nar/gks1279 · PMID: 23203889 · PMCID: PMC3531209

The way we get everything from PubMed is by downloading and parsing all of the xml files available in the "baseline" and "updatefiles" subdirectories located at ftp.ncbi.nlm.nih.gov/pubmed/ . Looking back through the code, it would seem that the only way those records aren't included is that they aren't contained in the xml files we parse to generate PMDB. I also browsed the ftp site for any possible indication of NCBI Bookshelf directories and the only thing I could find was some sort of related directory that didn't contain any sort of file we could parse out.

The unfortunate thing is that there is no real good way to check if they are contained in the files aside from manually checking the contents of each xml file, of which there are hundreds of files with over 1000 lines. It would also be considerable work to try and incorporate external data into our DB creation.

So, TL;DR: Doesn't seem like these records are in the files we use to create the database, so unless PubMed starts including them they won't make it into PMDB.

I see, probably worth contacting the help desk to inquire how we can download bookshelf records in bulk. I'm away for a week, but can get in touch with them when I'm back unless you do first.

@dhimmel So earlier this week I decided to do another look through the code and found we only parse out PubmedArticle tags and not PubmedBookArticle tags.

I then went ahead and made some modifications to the code and started checking the XML files to see if any PubmedBookArticle tags were contained within the XML files, and could not find any, so I emailed the NLM Help Desk and this is the response I got:

"Thank you for writing to the help desk. Book citations are not included in the FTP files. They can be retrieved from the web interface or with the PubMed E-Utilities API. "

I followed up by asking if they can be retrieved in bulk, but have yet to receive an answer.

Thanks @JSchoenbachler for looking into this!

Even if there is no bulk download, I wonder whether there is a way to get a list of all pubmed ids for NCBI bookshelf records? Looking forward to what you hear from the helpdesk.

Hi @dhimmel given the data aren't in the xml, this isn't going to be a priority for us. You're welcome to investigate more and make a pull request.

@dhimmel Apologies, I forgot to put the response from NLM here. Here it is:

"Dear Colleague,

E-utilities is generally useful with being able to search for unique identifiers of records matching a search term/topic, and then using those unique identifiers to fetch the records/articles that correspond to them. For example, users can, with e-utilities, search for all PMIDs corresponding to pubmed articles containing a search term of interest using esearch. Subsequently, efetch is used to obtain the articles corresponding to each PMID.

The API searches for results when provided with the database of interest to search, so it should work for bookshelf as well. There's an extensive resource of documentation explaining use of E-utilities.

This link specifically has a section regarding how to download large batches of data: https://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large
The overall documentation (introduction, examples, etc) can be found here: https://www.ncbi.nlm.nih.gov/books/NBK25501/"

Thanks @JSchoenbachler for posting the helpdesk reply. It's still unclear to me whether there is a way to search for a list of all all pubmed ids corresponding to bookshelf records, which seems like is the critical missing piece here.

@jakejh I think GitHub has a new option for "Close as not planned" which is probably more appropriate here than "Close as completed".