pydata/numexpr

[BUG]: Sanitizing regex does not exclude string literals

taldcroft opened this issue · 3 comments

4b2d89c introduces a regression when an expression includes a string literal with any of the new forbidden characters. This is breaking our production code when we upgrade numexpr to 2.8.7.

Example:

>>> import numexpr as ne
>>> ne.__version__
'2.8.7'
>>> import numpy as np

>>> x = np.array(['a', 'b'], dtype=bytes)
>>> ne.evaluate("x == 'b'")
array([False,  True])

>>> ne.evaluate("x == 'b:'")
Traceback (most recent call last):
  Cell In[6], line 1
    ne.evaluate("x == 'b:'")
  File ~/miniconda3/envs/numexpr/lib/python3.10/site-packages/numexpr/necompiler.py:975 in evaluate
    raise e
  File ~/miniconda3/envs/numexpr/lib/python3.10/site-packages/numexpr/necompiler.py:872 in validate
    _names_cache[expr_key] = getExprNames(ex, context, sanitize=sanitize)
  File ~/miniconda3/envs/numexpr/lib/python3.10/site-packages/numexpr/necompiler.py:721 in getExprNames
    ex = stringToExpression(text, {}, context, sanitize)
  File ~/miniconda3/envs/numexpr/lib/python3.10/site-packages/numexpr/necompiler.py:281 in stringToExpression
    raise ValueError(f'Expression {s} has forbidden control characters.')
ValueError: Expression x == 'b:' has forbidden control characters.

This could be fixed by firstly replacing content within quotes before trying to match blacked list. I will fix this and add some tests.

Thanks, looking forward to the next release! Looks like this can be closed now?