MrMimic/MEDOC

Weird stats in `medline_comments_corrections`

Closed this issue · 2 comments

Hi,

thanks for your work on this tool.

By looking at the data inserted in the mentioned medline_comments_corrections table I see quite a screwed ratio of entries. Specifically there are only 556/159415775 rows for which the field ref_pmid is not set to N/A.

Example:

  • PMID: 732892
  • URL: link
    It should have 16 entries in the table in question with a relevant ID, but instead only one shows an ID.

All the remaining ones have a value of N/A as depicted in the figure below
medoc-wrong-citation-network

Hello !

I can't laucnh MEDOC right now, but I think the problem came from this REGEX in MEDOC.py

'ref_pmid': re.findall('<pmid version="1">(.[0-9])</pmid>', str(comment)),

I corrected it with:

'ref_pmid': re.findall('<pmid version="1">([0-9]{1,4})</pmid>', str(comment)),

Let me know if it solved the problem :)

Hi!
I just got back and I looked at your comment. Actually I used the following regex in my fork:

'ref_pmid': re.findall('<pmid version="1">(\d+)</pmid>', str(comment)),
'type': re.findall('<commentscorrections reftype="(.*?)">', str(comment)),

In order to catch pmid longer than 4 digits. It seems to work fine on a couple of archive file from pubmed.

I may create a pull request again, if it may helps.