Incorrect dates getting extracted for OCR errored cases
Closed this issue · 4 comments
The parser is outputting incorrect dates for cases with slight OCR error (like digit 1
getting replaced by alphabet l
).
date_string = "January 11, 2028" # Digit 1
list(datefinder.find_dates(date_string))
# [datetime.datetime(2028, 1, 11, 0, 0)]
date_string = "January l1, 2028" # Digit 1 replaced by alphabet l
list(datefinder.find_dates(date_string))
# [datetime.datetime(2028, 1, 4, 0, 0)]
Ideally, it shouldn't be extracting any date in the second case - but it is silently giving the wrong date.
Is there any way to handle such cases explicitly?
Practically speaking, this is a feature and not a bug, rather!
@khanfarhan10
How I would like it to work: Throw an error (or not outputting any dates at all) if it cannot parse the dates correctly - so the user can handle that accordingly. Silently giving incorrect dates makes it very difficult to use it as-is on a variety of data without risking the performance.
Also, I am actually not able to comprehend from there it is getting the day as 4
.
Silently giving incorrect dates makes it very difficult to use it as-is on a variety of data without risking the performance.
This library is designed to extract as many possible dates within freeform text. This includes, by default incomplete dates (e.g. Jan 1987
and dates in odd formats 2020 Feb., 7th
. When something doesn't look like a date, we simply move on. No sense in throwing errors in random numbers or words (e.g. 5555-33-01
). That would make it unusable for many applications. If you have your date string isolated from the rest of your text, you can use tools like dateparser
to throw an error if it doesn't parse correctly.
The reason you are getting a day of month of 4
has to do with matching an incomplete date. Inside "January l1, 2028"
the library found January
and 2028
and did its best with it. Since a datetime has to have a day of the month and not just month/year, the underlying dateparser
library fills in gaps with a base date. You can override a base date here: https://github.com/akoumjian/datefinder/blob/master/datefinder/__init__.py#L320).
You have a couple different options, as I see it. If you use strict=True
with datefinder, you won't get matches for dates that don't consist of at least year, month, day of month. In your original example, it would simply skip over it. Since you are trying to extract dates from bad quality data (literally the date you want is not in your text due to the OCR error), you could alternatively use source=True
when you use datefinder and pass every result through a custom heuristic function or manual review process. This way you can look at the originally matched text "January l1, 2028"
and decide what to do with it.
I would consider investigating libraries which use statistical models to help reduce your OCR errors like the one above.