Weird stats in `medline_comments_corrections`
Closed this issue · 2 comments
Hi,
thanks for your work on this tool.
By looking at the data inserted in the mentioned medline_comments_corrections
table I see quite a screwed ratio of entries. Specifically there are only 556/159415775 rows for which the field ref_pmid
is not set to N/A.
Example:
- PMID: 732892
- URL: link
It should have 16 entries in the table in question with a relevant ID, but instead only one shows an ID.
All the remaining ones have a value of N/A as depicted in the figure below
Hello !
I can't laucnh MEDOC right now, but I think the problem came from this REGEX in MEDOC.py
'ref_pmid': re.findall('<pmid version="1">(.[0-9])</pmid>', str(comment)),
I corrected it with:
'ref_pmid': re.findall('<pmid version="1">([0-9]{1,4})</pmid>', str(comment)),
Let me know if it solved the problem :)
Hi!
I just got back and I looked at your comment. Actually I used the following regex in my fork:
'ref_pmid': re.findall('<pmid version="1">(\d+)</pmid>', str(comment)),
'type': re.findall('<commentscorrections reftype="(.*?)">', str(comment)),
In order to catch pmid longer than 4 digits. It seems to work fine on a couple of archive file from pubmed.
I may create a pull request again, if it may helps.