ropensci/rcrossref

id_converter() not converting PMIDs correctly

Adafede opened this issue · 11 comments

Session Info
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] zoo_1.8-8             XML_3.99-0.3          webchem_1.0.0         UpSetR_1.4.0          forcats_0.5.0        
 [6] tidyr_1.1.0           tibble_3.0.1          tidyverse_1.3.0       taxize_0.9.96         stringr_1.4.0        
[11] stringi_1.4.6         splitstackshape_1.4.8 rvest_0.3.5           xml2_1.3.2            reticulate_1.16      
[16] rentrez_1.2.2         readxl_1.3.1          readr_1.3.1           rcrossref_1.0.0       RColorBrewer_1.1-2   
[21] purrr_0.3.4           pbmcapply_1.5.0       jsonlite_1.6.1        igraph_1.2.5          ggraph_2.0.3         
[26] eulerr_6.1.0          dplyr_1.0.0           digest_0.6.25         data.table_1.12.8     collapsibleTree_0.1.7
[31] chorddiag_0.1.2       ChemmineR_3.40.0      plotly_4.9.2.1        Hmisc_4.4-0           ggplot2_3.3.1        
[36] Formula_1.2-3         survival_3.1-12       lattice_0.20-41      

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1    rjson_0.2.20        ellipsis_0.3.1      htmlTable_1.13.3    fs_1.4.1            base64enc_0.1-3    
 [7] httpcode_0.3.0      rstudioapi_0.11     farver_2.0.3        urltools_1.7.3      graphlayouts_0.7.0  ggrepel_0.8.2      
[13] DT_0.13             lubridate_1.7.8     fansi_0.4.1         codetools_0.2-16    splines_4.0.0       bold_1.0.0         
[19] knitr_1.28          polyclip_1.10-0     broom_0.5.6         dbplyr_1.4.4        cluster_2.1.0       png_0.1-7          
[25] ggforce_0.3.1       shiny_1.4.0.2       data.tree_0.7.11    compiler_4.0.0      httr_1.4.1          backports_1.1.7    
[31] assertthat_0.2.1    Matrix_1.2-18       fastmap_1.0.1       lazyeval_0.2.2      cli_2.0.2           later_1.1.0.1      
[37] tweenr_1.0.1        acepack_1.4.1       htmltools_0.4.0     tools_4.0.0         gtable_0.3.0        glue_1.4.1         
[43] rsvg_2.1            tinytex_0.23        Rcpp_1.0.4.6        cellranger_1.1.0    vctrs_0.3.1         crul_0.9.0         
[49] ape_5.4             nlme_3.1-148        iterators_1.0.12    xfun_0.14           mime_0.9            miniUI_0.1.1.1     
[55] lifecycle_0.2.0     MASS_7.3-51.6       scales_1.1.1        tidygraph_1.2.0     hms_0.5.3           promises_1.1.0     
[61] curl_4.3            gridExtra_2.3       triebeard_0.3.0     rpart_4.1-15        reshape_0.8.8       latticeExtra_0.6-29
[67] foreach_1.5.0       checkmate_2.0.0     bibtex_0.4.2.2      rlang_0.4.6         pkgconfig_2.0.3     bitops_1.0-6       
[73] htmlwidgets_1.5.1   tidyselect_1.1.0    plyr_1.8.6          magrittr_1.5        R6_2.4.1            generics_0.0.2     
[79] DBI_1.1.0           haven_2.3.1         pillar_1.4.4        foreign_0.8-80      withr_2.2.0         RCurl_1.98-1.2     
[85] nnet_7.3-14         modelr_0.1.8        crayon_1.3.4        viridis_0.5.1       jpeg_0.1-8.1        grid_4.0.0         
[91] blob_1.2.1          reprex_0.3.0        xtable_1.8-4        httpuv_1.5.4        munsell_0.5.0       viridisLite_0.3.0 

Hi,

Thank you very much for your beautiful package.

I am using your package to retrieve DOIs from various sources. When working with titles, I use your cr_works() function which is great.

However, when working with pubmed IDs, I face following issue:

Some valid pubmed IDs seem not to be recognized.

As an example: 28371833

This is the output I get when using id_converter("28371833", "pmid"):

$status
[1] "ok"
$responseDate
[1] "2020-06-08 02:03:29"
$request
[1] "tool=rcrossref;email=myrmecocystus%40gmail.com;ids=28371833;idtype=pmid;format=json"
$records
pmid live status errmsg
1 28371833 false error invalid article id

However, the article id is valid as easily recognized by entrez_summary(db = "pubmed", id = "28371833")[["title"]]

"Cytochrome P450 Monooxygenase CYP716A141 is a Unique β-Amyrin C-16β Oxidase Involved in Triterpenoid Saponin Biosynthesis in Platycodon grandiflorus."

It has nothing to do with the erratum, I checked other entries.

Some other IDs (31708947) work and I could not say why...

If any other infos are needed I am happy to give more details!

thanks for the report, having a look

The API request is here https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=rcrossref&email=myrmecocystus%40gmail.com&ids=28371833&idtype=pmid&format=json which gives the same response. So the problem is on the NCBI end of things. Not sure why they're saying its an invalid article ID.

(related issue #183 )

open citations corpus (https://github.com/ropenscilabs/citecorp) doesn't have that PMID either:

citecorp::oc_pmid2ids(28371833)
#> data frame with 0 columns and 0 rows

My bad... sorry for not re-opening there!

Strange from NCBI...but entrez seems to do the job correctly

no worries about opening this issue.

its hard to say why the problem is happening. the API service for id converter may be using some older database or something, there's no clarity on what's going on behind the scenes. You may be better of for ID conversion to us rentrez

Hi @sckott and @Adafede,

If I understand correctly id_converter() is built on NLM's ID Converter API which is limited to records in the PMC.

@JimHokanson explains in ropensci/rentrez#136 (comment)

As for a workaround, @dwinter's rentrez allows you make the conversion using rentrez::parse_pubmed_xml and rentrez::pubmed_fetch: ropensci/rentrez#136 (comment)

But that's a lot of extra data to download for just PMID-DOI conversion (when scaling to many records), so it would be great if there were a simpler converter. Ideally that also works from DOI to PMID (which is what I'm trying to do).

Here are some related links I've come across:
https://www.crossref.org/labs/pmid2doi/
https://www.pmid2cite.com/ (promising, but I'm not finding any open source or an API for batch processing)
Via their website:
https://www.pmid2cite.com/pmid-to-doi-converter
https://www.pmid2cite.com/doi-to-pmid-converter

I'd appreciate any further suggestions.

Hi, I'm not sure this is the right place for your question but anyway, pubmed API does the job perfectly if you just aim at converting DOIs to PM(C)IDS and vice versa.

You can also download locally pubmed conversions table if you really need it to be fast. (you could have a look at https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/)

Thanks, @Adafede. Unfortunately, the NLM converter doesn't work for DOIs not available in PMC, similar to the PMID limitation.

For example: 10.1056/NEJMoa1916623
https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email@example.com&ids=10.1056/NEJMoa1916623

id_converter() is built on NLM's ID Converter API

correct

We used to have a function for that Crossref pmid2doi service, see ?rcrossref-defunct, but we made it defunct, i think it was too unreliable or went down, not sure .

Hadn't seen pmid2cite - agree that it doesn't look like there's any way to programatically use it.

Your example of 10.1056/NEJMoa1916623 might be a case where its so new that there isn't a PMID for it yet, Crossref and Unpaywall have the DOI, but they don't map to other identifiers.

at least I don't think there's anything left to do here

Just in case someone stumbles on this awesome thread, do check out https://www.flickr.com/photos/dullhunk/454160748 that has some advice on this