scrapinghub/dateparser

Bad escape characters trigger an exception

Etirf opened this issue · 17 comments

Etirf commented

Note: As a workaround for this issue, we have pinned regex. Which makes Python 3.11 support either impossible or uncomfortable. The goal now is to remove that version pin on regex without making this issue resurface.

Hello everyone,

Tried parsing under python 3.7.5 and 3.9

dateparser.parse('12/12/12')

It also gives the same output for any "valid" input shown in the doc:

dateparser.parse('Fri, 12 Dec 2014 10:55:50')
dateparser.parse('22 Décembre 2010', date_formats=['%d %B %Y'])
...

Here's the error:


---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 dateparser.parse("12/12/12")

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\conf.py:92, in apply_settings.<locals>.wrapper(*args, **kwargs)
     89 if not isinstance(kwargs['settings'], Settings):
     90     raise TypeError("settings can only be either dict or instance of Settings class")
---> 92 return f(*args, **kwargs)

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\__init__.py:61, in parse(date_string, date_formats, languages, locales, region, settings, detect_languages_function)
     57 if languages or locales or region or detect_languages_function or not settings._default:
     58     parser = DateDataParser(languages=languages, locales=locales,
     59                             region=region, settings=settings, detect_languages_function=detect_languages_function)
---> 61 data = parser.get_date_data(date_string, date_formats)
     63 if data:
     64     return data['date_obj']

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:428, in DateDataParser.get_date_data(self, date_string, date_formats)
    425 date_string = sanitize_date(date_string)
    427 for locale in self._get_applicable_locales(date_string):
--> 428     parsed_date = _DateLocaleParser.parse(
    429         locale, date_string, date_formats, settings=self._settings)
    430     if parsed_date:
    431         parsed_date['locale'] = locale.shortname

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:178, in _DateLocaleParser.parse(cls, locale, date_string, date_formats, settings)
    175 @classmethod
    176 def parse(cls, locale, date_string, date_formats=None, settings=None):
    177     instance = cls(locale, date_string, date_formats, settings)
--> 178     return instance._parse()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:182, in _DateLocaleParser._parse(self)
    180 def _parse(self):
    181     for parser_name in self._settings.PARSERS:
--> 182         date_data = self._parsers[parser_name]()
    183         if self._is_valid_date_data(date_data):
    184             return date_data

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:196, in _DateLocaleParser._try_freshness_parser(self)
    194 def _try_freshness_parser(self):
    195     try:
--> 196         return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
    197     except (OverflowError, ValueError):
    198         return None

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:234, in _DateLocaleParser._get_translated_date(self)
    232 def _get_translated_date(self):
    233     if self._translated_date is None:
--> 234         self._translated_date = self.locale.translate(
    235             self.date_string, keep_formatting=False, settings=self._settings)
    236     return self._translated_date

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:131, in Locale.translate(self, date_string, keep_formatting, settings)
    128 dictionary = self._get_dictionary(settings)
    129 date_string_tokens = dictionary.split(date_string, keep_formatting)
--> 131 relative_translations = self._get_relative_translations(settings=settings)
    133 for i, word in enumerate(date_string_tokens):
    134     word = word.lower()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:158, in Locale._get_relative_translations(self, settings)
    155 if settings.NORMALIZE:
    156     if self._normalized_relative_translations is None:
    157         self._normalized_relative_translations = (
--> 158             self._generate_relative_translations(normalize=True))
    159     return self._normalized_relative_translations
    160 else:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:172, in Locale._generate_relative_translations(self, normalize)
    170     value = list(map(normalize_unicode, value))
    171 pattern = '|'.join(sorted(value, key=len, reverse=True))
--> 172 pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
    173 pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
    174 relative_dictionary[pattern] = key

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\regex.py:700, in _compile_replacement_helper(pattern, template)
    695     break
    696 if ch == "\\":
    697     # '_compile_replacement' will return either an int group reference
    698     # or a string literal. It returns items (plural) in order to handle
    699     # a 2-character literal (an invalid escape sequence).
--> 700     is_group, items = _compile_replacement(source, pattern, is_unicode)
    701     if is_group:
    702         # It's a group, so first flush the literal.
    703         if literal:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\_regex_core.py:1736, in _compile_replacement(source, pattern, is_unicode)
   1733         if value is not None:
   1734             return False, [value]
-> 1736     raise error("bad escape \\%s" % ch, source.string, source.pos)
   1738 if isinstance(source.sep, bytes):
   1739     octal_mask = 0xFF

error: bad escape \d at position 7

How to reproduce:
Env: windows 10

  • Fresh install of python 3.7.5 or 3.9
  • Make a simple python file including these 2 lines:
import dateparser
dateparser.parse("12/12/12")

I am seeing the exact same behavior with code that worked just 2 hours ago. This is on macOS. I tested with python 3.8.2, 3.8.5, and 3.10.2

Same here. Python 3.7.12, macOS.

Same here, Python 3.9-slim and 3.10-slim docker images, sample code:

from dateparser import parse
parse("7 days ago")

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 428, in get_date_data
    parsed_date = _DateLocaleParser.parse(
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 178, in parse
    return instance._parse()
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 182, in _parse
    date_data = self._parsers[parser_name]()
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 234, in _get_translated_date
    self._translated_date = self.locale.translate(
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
    relative_translations = self._get_relative_translations(settings=settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
    self._generate_relative_translations(normalize=True))
  File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
    pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
  File "/usr/local/lib/python3.10/site-packages/regex/regex.py", line 700, in _compile_replacement_helper
    is_group, items = _compile_replacement(source, pattern, is_unicode)
  File "/usr/local/lib/python3.10/site-packages/regex/_regex_core.py", line 1736, in _compile_replacement
    raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

We were using dateparser==1.0.0, upgrading to dateparser==1.1.0 didn't solve the issue.

dependency regex==2022.3.15 made this probably
rolling back to regex==2022.1.18 may help

update: this commit
mrabarnett/mrab-regex@138970b

I can confirm that deploying regex==2022.1.18 instead (through conda in my case) makes the bug disappear.

Caused by behaviour change introduced in mrabarnett/mrab-regex@138970b (released as regex v2022.3.15), installing any version before this (eg v2022.3.2) should fix

Change was to now raise on invalid ASCII escape characters in pattern compiling and substitution. Not sure if it's a bug with dateparser or regex

This will be a problem on all supported platforms and environments (Linux, MacOS, Windows; Python 3.6 to 3.10)

Making CI/CD break when installing latest version. Please update the PyPI package too, thanks a lot.

Hi. I was also faced with the same problem (and thought it was a Mac M1 problem with the regex lib).
It turns out to be related to the drop of Python 3.6 support in regex :

Since Python 3.6, the re module has been rejecting unknown escape sequences such as \q in patterns and escape sequences including \d in replacement templates.

As the regex module no longer supports versions of Python <3.6, I've brought the regex module into line with re.

You code should now read:

pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\\d+', pattern)

More info in mrabarnett/mrab-regex/issues/459

Here is a problematic pattern but there may be more?

I can confirm that this issue is NOT specific to MacOS - our CI/CD uses Linux machines and was affected by this. My local machine, running Ubuntu, was also affected.

Explicitly pinning regex==2022.1.18 as suggested by @xiaopc fixed it for us.

Thanks for the fix and for writing the library in the first place. This seems to me to be one of the best date parsing libraries, we use it for a lot of data imports. Hoping for a soon pip release as well. Keep up the good work 👍

Many thanks for thorough investigation!
For now I'll make a quick fix by pinning regex version, but in the long run we should follow @tducret's suggestion (#1045 (comment)) and reform the regexes.

If anyone's up for a PR with the fix, please go ahead!

Is it possible to push the version 1.1.1 to pypi please?

Thank you for raising that, @rerowep. It seems like the PyPI publish action got stuck. It's published now 👍

thmo commented

Currently the issue is quick-fixed by pinning regex to an older version, which is not applicable in certain environments, e.g., with modules installed via RPMs.

Wouldn't something like this fix the issue:

--- a/dateparser/languages/locale.py
+++ b/dateparser/languages/locale.py
@@ -169,7 +169,7 @@ class Locale:
             if normalize:
                 value = list(map(normalize_unicode, value))
             pattern = '|'.join(sorted(value, key=len, reverse=True))
-            pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
+            pattern = pattern.replace(r'\d+', r'?P<n>\d+')
             pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
             relative_dictionary[pattern] = key
         return relative_dictionary

Based on this comment. Note that I'm not sure this is correct or complete, but judging on a a run of the testsuite together with regex-2022.3.15, it seems to work (besides some imho unrelated things, which are also broken with regex-2022.3.2).

Reopening until we fix it properly.

Independently arrived on the same solution as the PR, explanation for the bug here

Fine. Expecting now a new publish on pypi !