mtlynch/ingredient-phrase-tagger

tokenizer.tokenize issue in Python >=3.7

dgrant opened this issue · 0 comments

There is an issue with tokenizer.tokenize in Python 3.6 and greater.

TokenizerTest fails. I've just extracted the smallest piece that is failing to narrow down the issue.

Python 3.6.12 & Python 2.7.18:

Python 3.6.12 (default, Dec  7 2020, 13:18:37) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split(r'([,\(\)])?\s*', "2 tablespoons milk")
/home/david/.asdf/installs/python/3.6.12/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['2', None, 'tablespoons', None, 'milk']

Python 3.7.3

Python 3.7.3 (default, Jun 12 2020, 01:19:31) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split(r'([,\(\)])?\s*', "2 tablespoons milk")
['', None, '2', None, '', None, 't', None, 'a', None, 'b', None, 'l', None, 'e', None, 's', None, 'p', None, 'o', None, 'o', None, 'n', None, 's', None, '', None, 'm', None, 'i', None, 'l', None, 'k', None, '']

So one thing that should be done is the tests should be run on a new Python (>=3.7), ideally, several versions are always a good idea. Secondly, the problem needs to be fixed.

The FutureWarning there is actually useful... "FutureWarning: split() requires a non-empty pattern match." It looks like it can be fixed simply by using a \s+ in the regex instead of \s*