PMC OA: tags in the <journal-title> field break parse_pubmed_xml
aren-lorenson-enveda opened this issue · 2 comments
aren-lorenson-enveda commented
Describe the bug
Tags in the <journal-title>
value cause:
File ".../pubmed_oa_parser.py", line 153, in parse_pubmed_xml
journal = " ".join([j.text for j in journal_node])
TypeError: sequence item 0: expected str instance, NoneType found
In the case of https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=7147450, changing
<italic>In Vivo</italic> Models of Inflammation
to
In Vivo Models of Inflammation
fixes it.
To Reproduce
import pubmed_parser as pp
pp.parse_pubmed_xml('efetch_7147450.xml')
pp.parse_pubmed_xml('efetch_7147450_fixed.xml')
titipata commented
@aren-lorenson-enveda thanks for your issue. I guess a quick fix is to change to the following:
journal = " ".join([j.text for j in journal_node if j is not None])
However, I will check if it works properly later on (or if you try and it works, feel free to make a pull request)!
Michael-E-Rose commented
@nils-herrmann could you please provide a PR? @titipata here has an untested fix.