Bad escape characters trigger an exception
Etirf opened this issue · 17 comments
Note: As a workaround for this issue, we have pinned regex. Which makes Python 3.11 support either impossible or uncomfortable. The goal now is to remove that version pin on regex without making this issue resurface.
Hello everyone,
Tried parsing under python 3.7.5 and 3.9
dateparser.parse('12/12/12')
It also gives the same output for any "valid" input shown in the doc:
dateparser.parse('Fri, 12 Dec 2014 10:55:50')
dateparser.parse('22 Décembre 2010', date_formats=['%d %B %Y'])
...
Here's the error:
---------------------------------------------------------------------------
error Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 dateparser.parse("12/12/12")
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\conf.py:92, in apply_settings.<locals>.wrapper(*args, **kwargs)
89 if not isinstance(kwargs['settings'], Settings):
90 raise TypeError("settings can only be either dict or instance of Settings class")
---> 92 return f(*args, **kwargs)
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\__init__.py:61, in parse(date_string, date_formats, languages, locales, region, settings, detect_languages_function)
57 if languages or locales or region or detect_languages_function or not settings._default:
58 parser = DateDataParser(languages=languages, locales=locales,
59 region=region, settings=settings, detect_languages_function=detect_languages_function)
---> 61 data = parser.get_date_data(date_string, date_formats)
63 if data:
64 return data['date_obj']
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:428, in DateDataParser.get_date_data(self, date_string, date_formats)
425 date_string = sanitize_date(date_string)
427 for locale in self._get_applicable_locales(date_string):
--> 428 parsed_date = _DateLocaleParser.parse(
429 locale, date_string, date_formats, settings=self._settings)
430 if parsed_date:
431 parsed_date['locale'] = locale.shortname
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:178, in _DateLocaleParser.parse(cls, locale, date_string, date_formats, settings)
175 @classmethod
176 def parse(cls, locale, date_string, date_formats=None, settings=None):
177 instance = cls(locale, date_string, date_formats, settings)
--> 178 return instance._parse()
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:182, in _DateLocaleParser._parse(self)
180 def _parse(self):
181 for parser_name in self._settings.PARSERS:
--> 182 date_data = self._parsers[parser_name]()
183 if self._is_valid_date_data(date_data):
184 return date_data
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:196, in _DateLocaleParser._try_freshness_parser(self)
194 def _try_freshness_parser(self):
195 try:
--> 196 return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
197 except (OverflowError, ValueError):
198 return None
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:234, in _DateLocaleParser._get_translated_date(self)
232 def _get_translated_date(self):
233 if self._translated_date is None:
--> 234 self._translated_date = self.locale.translate(
235 self.date_string, keep_formatting=False, settings=self._settings)
236 return self._translated_date
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:131, in Locale.translate(self, date_string, keep_formatting, settings)
128 dictionary = self._get_dictionary(settings)
129 date_string_tokens = dictionary.split(date_string, keep_formatting)
--> 131 relative_translations = self._get_relative_translations(settings=settings)
133 for i, word in enumerate(date_string_tokens):
134 word = word.lower()
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:158, in Locale._get_relative_translations(self, settings)
155 if settings.NORMALIZE:
156 if self._normalized_relative_translations is None:
157 self._normalized_relative_translations = (
--> 158 self._generate_relative_translations(normalize=True))
159 return self._normalized_relative_translations
160 else:
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:172, in Locale._generate_relative_translations(self, normalize)
170 value = list(map(normalize_unicode, value))
171 pattern = '|'.join(sorted(value, key=len, reverse=True))
--> 172 pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
173 pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
174 relative_dictionary[pattern] = key
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\regex.py:700, in _compile_replacement_helper(pattern, template)
695 break
696 if ch == "\\":
697 # '_compile_replacement' will return either an int group reference
698 # or a string literal. It returns items (plural) in order to handle
699 # a 2-character literal (an invalid escape sequence).
--> 700 is_group, items = _compile_replacement(source, pattern, is_unicode)
701 if is_group:
702 # It's a group, so first flush the literal.
703 if literal:
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\_regex_core.py:1736, in _compile_replacement(source, pattern, is_unicode)
1733 if value is not None:
1734 return False, [value]
-> 1736 raise error("bad escape \\%s" % ch, source.string, source.pos)
1738 if isinstance(source.sep, bytes):
1739 octal_mask = 0xFF
error: bad escape \d at position 7
How to reproduce:
Env: windows 10
- Fresh install of python 3.7.5 or 3.9
- Make a simple python file including these 2 lines:
import dateparser
dateparser.parse("12/12/12")
I am seeing the exact same behavior with code that worked just 2 hours ago. This is on macOS. I tested with python 3.8.2, 3.8.5, and 3.10.2
Same here. Python 3.7.12, macOS.
Same here, Python 3.9-slim and 3.10-slim docker images, sample code:
from dateparser import parse
parse("7 days ago")
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
data = parser.get_date_data(date_string, date_formats)
File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 428, in get_date_data
parsed_date = _DateLocaleParser.parse(
File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 178, in parse
return instance._parse()
File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 182, in _parse
date_data = self._parsers[parser_name]()
File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
File "/usr/local/lib/python3.10/site-packages/dateparser/date.py", line 234, in _get_translated_date
self._translated_date = self.locale.translate(
File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
relative_translations = self._get_relative_translations(settings=settings)
File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
self._generate_relative_translations(normalize=True))
File "/usr/local/lib/python3.10/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
File "/usr/local/lib/python3.10/site-packages/regex/regex.py", line 700, in _compile_replacement_helper
is_group, items = _compile_replacement(source, pattern, is_unicode)
File "/usr/local/lib/python3.10/site-packages/regex/_regex_core.py", line 1736, in _compile_replacement
raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7
We were using dateparser==1.0.0
, upgrading to dateparser==1.1.0
didn't solve the issue.
dependency regex==2022.3.15
made this probably
rolling back to regex==2022.1.18
may help
update: this commit
mrabarnett/mrab-regex@138970b
I can confirm that deploying regex==2022.1.18 instead (through conda in my case) makes the bug disappear.
Caused by behaviour change introduced in mrabarnett/mrab-regex@138970b (released as regex
v2022.3.15), installing any version before this (eg v2022.3.2) should fix
Change was to now raise on invalid ASCII escape characters in pattern compiling and substitution. Not sure if it's a bug with dateparser
or regex
This will be a problem on all supported platforms and environments (Linux, MacOS, Windows; Python 3.6 to 3.10)
Making CI/CD break when installing latest version. Please update the PyPI package too, thanks a lot.
Hi. I was also faced with the same problem (and thought it was a Mac M1 problem with the regex
lib).
It turns out to be related to the drop of Python 3.6 support in regex
:
Since Python 3.6, the re module has been rejecting unknown escape sequences such as
\q
in patterns and escape sequences including\d
in replacement templates.As the regex module no longer supports versions of Python <3.6, I've brought the regex module into line with re.
You code should now read:
pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\\d+', pattern)
More info in mrabarnett/mrab-regex/issues/459
Here is a problematic pattern but there may be more?
I can confirm that this issue is NOT specific to MacOS - our CI/CD uses Linux machines and was affected by this. My local machine, running Ubuntu, was also affected.
Explicitly pinning regex==2022.1.18
as suggested by @xiaopc fixed it for us.
Thanks for the fix and for writing the library in the first place. This seems to me to be one of the best date parsing libraries, we use it for a lot of data imports. Hoping for a soon pip release as well. Keep up the good work 👍
Many thanks for thorough investigation!
For now I'll make a quick fix by pinning regex
version, but in the long run we should follow @tducret's suggestion (#1045 (comment)) and reform the regexes.
If anyone's up for a PR with the fix, please go ahead!
Is it possible to push the version 1.1.1 to pypi please?
Thank you for raising that, @rerowep. It seems like the PyPI publish action got stuck. It's published now 👍
Currently the issue is quick-fixed by pinning regex to an older version, which is not applicable in certain environments, e.g., with modules installed via RPMs.
Wouldn't something like this fix the issue:
--- a/dateparser/languages/locale.py
+++ b/dateparser/languages/locale.py
@@ -169,7 +169,7 @@ class Locale:
if normalize:
value = list(map(normalize_unicode, value))
pattern = '|'.join(sorted(value, key=len, reverse=True))
- pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
+ pattern = pattern.replace(r'\d+', r'?P<n>\d+')
pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
relative_dictionary[pattern] = key
return relative_dictionary
Based on this comment. Note that I'm not sure this is correct or complete, but judging on a a run of the testsuite together with regex-2022.3.15, it seems to work (besides some imho unrelated things, which are also broken with regex-2022.3.2).
Reopening until we fix it properly.
Independently arrived on the same solution as the PR, explanation for the bug here
Fine. Expecting now a new publish on pypi !