You codes mistake ages for a year value
gkuling opened this issue · 2 comments
Hello, I am using your package in my project to analyze hospital records and I have noticed an interesting bug I thought I'd share. It identifies ages in the text and converts them into a year value. For example:
test = 'Clinical history: 52-year-old man has...'
will be identified as
datefinder.find_dates(test, base_date=dt(2022, 7, 15))
> datetime.datetime(1952, 7, 15, 0, 0)
Also, the most recent install with pip doesn't have the 'first' parameter option in the DateFinder init function.
Great package btw, thank you
Signed - Grey
I suggest using the strict=True
parameter to make it pickier. There are times when people are looking for almost any date related value and unfortunately that often produces a bunch of false positives. The strict
param will only surface dates that look to have a year, month, and day of month.
PyPI has been updated as well!
I'll also note, an alternative approach would be to not use strict=True
and instead use source=True
. This will give you the original text it found that matched and let you run some heuristics on whether or not you want to accept or reject the date.
In [2]: text = 'Clinical history: 52-year-old man has...'
In [3]: print(list(datefinder.find_dates(text, source=True)))
[(datetime.datetime(2052, 8, 3, 0, 0), '52')]
You could do analysis on the source string of "52" and decide it's not sufficient.
In [4]: print(list(datefinder.find_dates(text, source=True, strict=True)))
[]
You can see the strict flag does not consider it a match, but there's no saying what you might consider a valid but incomplete date string that also won't be picked up by this.