titipata/pubmed_parser

PMC OA: tags in the <journal-title> field break parse_pubmed_xml

aren-lorenson-enveda opened this issue · 2 comments

Describe the bug
Tags in the <journal-title> value cause:

File ".../pubmed_oa_parser.py", line 153, in parse_pubmed_xml
    journal = " ".join([j.text for j in journal_node])
TypeError: sequence item 0: expected str instance, NoneType found

In the case of https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=7147450, changing

<italic>In Vivo</italic> Models of Inflammation

to

In Vivo Models of Inflammation

fixes it.

To Reproduce

import pubmed_parser as pp
pp.parse_pubmed_xml('efetch_7147450.xml')
pp.parse_pubmed_xml('efetch_7147450_fixed.xml')

xmls.zip

@aren-lorenson-enveda thanks for your issue. I guess a quick fix is to change to the following:

journal = " ".join([j.text for j in journal_node if j is not None])

However, I will check if it works properly later on (or if you try and it works, feel free to make a pull request)!

@nils-herrmann could you please provide a PR? @titipata here has an untested fix.