titipata/pubmed_parser

Pubdate not returning correct year

MatthewDeitz opened this issue · 4 comments

There is a problem with some of the pubdate fields in the output. It is not pulling the correct year and instead is splitting the text based off of " " and grabbing the first chunk of text. Because of this you end up with results for pubdate like ["Summer","Winter"]. Some example pmid's this happens for is [28599031,28599032,28599033, etc]. Could you please update to match on some form of regular expression like "\d{4}" instead of splitting on the whitespace and just grabbing the first chunk?

Hello @MatthewDeitz, first, thanks for the issue! Is there any way that you can share XML files of the following PMIDs with me? And yes, I can update using regular expression instead of splitting on the whitespace. Alternatively, if you already fixed the parser, feel free to send the pull request.

@MatthewDeitz, I updated the parse at de61d61. Let me know if this solves the issue.

Hi @titipata I am parsing pubmed files from this location, ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ I have extracted the xml files to a folder and using the pp.parse_medline_xml() with year_info_only=False but I am only getting year and month the parser is not parsing the day even when the day is mentioned in the xml file.

Can you please tell if this is the correct behaviour? If not can you please direct me where the problem may be I will give it a try to fix it.

Thankyou!

Hi @kaustubhn, thanks so much! So it should parse the date if it is available and return it. The function to do that is at https://github.com/titipata/pubmed_parser/blob/master/pubmed_parser/medline_parser.py#L241-L242. Feel free to fix it and make the PR!