/pubmedtk

An R package for accessing a variety of PubMed data, including abstracts, bibliometrics, pubtations, and full-text records, through a single, user-friendly interface.

Primary LanguageROtherNOASSERTION

R build status

pubmedtk

The package provides a single interface for accessing a range of NLM/PubMed databases, including PubMed abstract records, iCite bibliometric data, PubTator named entity annotations, and full-text entries from PubMed Central (PMC). This unified interface simplifies the data retrieval process, allowing users to interact with multiple PubMed services/APIs/output formats through a single R function.

The package also includes MeSH ontology resources as simple data frames, including Descriptor Terms, Descriptor Tree Structures, Supplementary Concept Terms, and Pharmacological Actions; it also includes descriptor-level word embeddings (Noh & Kavuluru 2021). Via the mesh-resources library.

Installation

You can download the development version from GitHub with:

devtools::install_github("jaytimm/pubmedtk")

Usage

PubMed search

The package has two basic functions: search_pubmed and get_records. The former fetches PMIDs from the PubMed API based on user search; the latter scrapes PMID records from a user-specified PubMed endpoint – pubmed_abstracts, pubmed_affiliations, pubtations, icites, or pmc_fulltext.

Search syntax is the same as that implemented in standard PubMed search.

pmids <- pubmedtk::search_pubmed('("political ideology"[TiAb])',
                                 use_pub_years = F)

# pmids <- pubmedtk::search_pubmed('immunity', 
#                                  use_pub_years = T,
#                                  start_year = 2022,
#                                  end_year = 2024) 

Get record-level data

pubmed <- pmids |> 
  pubmedtk::get_records(endpoint = 'pubmed_abstracts', 
                        cores = 3, 
                        sleep = 1) 

affiliationss <- pmids |> 
  pubmedtk::get_records(endpoint = 'pubmed_affiliations', 
                        cores = 3, 
                        sleep = 0.5)

icites <- pmids |> 
  pubmedtk::get_records(endpoint = 'icites',
                        cores = 4, 
                        sleep = 0.25)

pubtations <- pmids |> 
  pubmedtk::get_records(endpoint = 'pubtations')

When the endpoint is PMC, the `get_records() function takes a vector of filepaths (from the PMC Open Access list) instead of PMIDs.

pmclist <- pubmedtk::data_pmc_list(force_install = F)
pmc_pmids <- pmclist[PMID %in% pmids]

pmc_fulltext <- pmc_pmids$fpath[1:20] |> 
  pubmedtk::get_records(endpoint = 'pmc_fulltext', cores = 2)