ropensci/europepmc

epmc_search returns fewer fields than available in the API

Opened this issue · 1 comments

Thank you for this package, maintainers!

I notice that epmc_search doesn't return some of the useful fields that are available in the API. I think it would would be valuable to return all fields. For example, the API returns both the boolean hasTMAaccessionNumbers but also the accessionType, while the package returns only the former.

Example of different fields returned:

library(europepmc)
library(httr)

# get results for one id from the package and the api
package_result <- epmc_search("PMC10669250")
direct_api_result <-
  GET('https://www.ebi.ac.uk/europepmc/webservices/rest/search?', 
          query = list(query='PMC10669250',
                       resultType='lite',
                       format='json')
      ) |>
  content()

# compare fields returned
package_result |> names()
direct_api_result$resultList$result[[1]] |> unlist() |> names()

from the package:

 [1] "id"                    "source"                "pmcid"                 "title"                 "authorString"          "journalTitle"          "issue"                
 [8] "journalVolume"         "pubYear"               "journalIssn"           "pubType"               "isOpenAccess"          "inEPMC"                "inPMC"                
[15] "hasPDF"                "hasBook"               "hasSuppl"              "citedByCount"          "hasReferences"         "hasTextMinedTerms"     "hasDbCrossReferences" 
[22] "hasLabsLinks"          "hasTMAccessionNumbers" "firstIndexDate"        "firstPublicationDate" 

from the API:

 [1] "id"                                "source"                            "pmcid"                             "fullTextIdList.fullTextId"        
 [5] "title"                             "authorString"                      "journalTitle"                      "issue"                            
 [9] "journalVolume"                     "pubYear"                           "journalIssn"                       "pubType"                          
[13] "isOpenAccess"                      "inEPMC"                            "inPMC"                             "hasPDF"                           
[17] "hasBook"                           "hasSuppl"                          "citedByCount"                      "hasReferences"                    
[21] "hasTextMinedTerms"                 "hasDbCrossReferences"              "hasLabsLinks"                      "hasTMAccessionNumbers"            
[25] "tmAccessionTypeList.accessionType" "firstIndexDate"                    "firstPublicationDate"    

Hi @arvi1000,
You're right, the default method only returns a subset of Europe PMC data. To access all data, use the raw option. Here's an example parser for your query:

library(europepmc)
library(tidyverse)
my_epmc_data <- epmc_search("PMC10669250", output = "raw")
#> 1 records found, returning 1

tibble::tibble(
  id = map_chr(my_epmc_data, "id"),
  tm_accession_type = map(my_epmc_data, "tmAccessionTypeList") |>
    map_chr("accessionType")
)
#> # A tibble: 1 × 2
#>   id          tm_accession_type
#>   <chr>       <chr>            
#> 1 PMC10669250 chebi

Created on 2024-06-12 with reprex v2.1.0