tokenizer.tokenize issue in Python >=3.7
dgrant opened this issue · 0 comments
dgrant commented
There is an issue with tokenizer.tokenize in Python 3.6 and greater.
TokenizerTest fails. I've just extracted the smallest piece that is failing to narrow down the issue.
Python 3.6.12 & Python 2.7.18:
Python 3.6.12 (default, Dec 7 2020, 13:18:37)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split(r'([,\(\)])?\s*', "2 tablespoons milk")
/home/david/.asdf/installs/python/3.6.12/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
['2', None, 'tablespoons', None, 'milk']
Python 3.7.3
Python 3.7.3 (default, Jun 12 2020, 01:19:31)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split(r'([,\(\)])?\s*', "2 tablespoons milk")
['', None, '2', None, '', None, 't', None, 'a', None, 'b', None, 'l', None, 'e', None, 's', None, 'p', None, 'o', None, 'o', None, 'n', None, 's', None, '', None, 'm', None, 'i', None, 'l', None, 'k', None, '']
So one thing that should be done is the tests should be run on a new Python (>=3.7), ideally, several versions are always a good idea. Secondly, the problem needs to be fixed.
The FutureWarning
there is actually useful... "FutureWarning: split() requires a non-empty pattern match." It looks like it can be fixed simply by using a \s+
in the regex instead of \s*