Programmatically access PubMed article is a common task for me. Luckily, with the help of eutils, we can access full article data in XML format. What I need is Python objects, not just XML strings, so pubmed-mapper was born.
pip install pubmed-mapper
from pubmed_mapper import Article
article = Article.parse_pmid('32329900')
# PubMed ID
print(article.pmid) # 32329900
# ids
print(article.ids) # [pubmed: 32329900, doi: 10.1111/jgs.16467]
print(article.ids[1].id_type) # doi
print(article.ids[1].id_value) # 10.1111/jgs.16467
# title
print(article.title) # Associations of Coffee...
# abstract
print(article.abstract) # <p><strong>Background: </strong>Coffee and tea...
# keywords
print(article.keywords) # ['aging', 'coffee; diet; longevity', 'tea']
# MeSH headings
print(article.mesh_headings) # ['Aged', 'Body Mass Index', '...']
# authors
print(article.authors) # [hadyab AH Aladdin H, Manson JE JoAnn E, ...]
print(article.authors[0].last_name) # Shadyab
print(article.authors[0].forename) # Aladdin H
print(article.authors[0].initials) # AH
print(article.authors[0].affiliation) # Department of Family...
# journal
print(article.journal) # Journal of the American Geriatrics Society
print(article.journal.issn) # 1532-5415
print(article.journal.issn_type) # Electronic
print(article.journal.title) # Journal of the American Geriatrics Society
print(article.journal.abbr) # J Am Geriatr Soc
# volume
print(article.volume) # 68
# issue
print(article.issue) # 9
# references
print(article.references) # [n. 2013;129:643-659....]
print(article.references[0].citation) # Lotfield E, Freedman ND...
print(article.references[0].ids) # []
# pubdate
print(article.pubdate) # 2020-09-01
from lxml import etree
from pubmed_mapper import Article
infile = 'xxx.xml'
with open(infile) as fp:
root = etree.parse(fp)
articles = []
for pubmed_article_element in root.xpath('/PubmedArticleSet/PubmedArticle'):
article = Article.parse_element(pubmed_article_element)
articles.append(article)
pubmed-mapper pmid -p 32329900
pubmed-mapper file -i data/pubmed21n0001.xml -o output/pubmed21n0001.jl
pubmed-mapper directory -i data/ -o output/pubmed-mapper.jl
4.1 There many types of PubMed article publication date, how do you convert it to datetime.date object?
Parse publication date is a hard work, until now pubmed-mapper can't parse all types of them. The types pubmed-mapper can be parsed and the parsed value are:
type | value |
---|---|
2021-03-13 | 2021-03-13 |
2021-03 | 2021-03-01 |
2021 Spring | 2021-04-01 |
2021 | 2021-01-01 |
2021 Jan-Feb | 2021-01-01 |
2021 Mar 13-15 | 2021-03-13 |
2021 Mar-2022 Jan | 2021-03-01 |
2021-2022 | 2021-01-01 |
2021 Mar 13-Dec 15 | 2021-03-13 |
1976-1977 Winter | 1976-01-01 |
1977-1978 Fall-Winter | 1977-10-01 |
pubmed-mapper.log is the default log file generate by pubmed-mapper, you can change the file by using --log-file options:
pubmed-mapper --log-file my-custom.log file -i data/pubmed21n0001.xml -o output/pubmed21n0001.jl
You can go to this log file to find out more parsing details.
Using --log-level can log more detail message:
pubmed-mapper --log-file my-custom.log --log-level DEBUG file -i data/pubmed21n0001.xml -o output/pubmed21n0001.jl