xarray-contrib/cf-xarray

Can we switch `re` to `regex`?

kthyng opened this issue · 1 comments

I have a limited understanding of the difference between the two regular expression packages, but re won't allow patterns anymore in which "global flags" like (?i) are present not at the beginning of a regular expression pattern, whereas regex will. I have been setting up my custom vocabularies such that a flag like that might end up later in a pattern because they can be linked together with |.

For example,

import cf_xarray as cfx
import xarray as xr

vocab = {"sea_ice_u": {"name": "(?i)^(?!.*(qc|status))(?=.*sea)(?=.*ice)(?=.*u)|(?i)^(?!.*(qc|status))(?=.*sea)(?=.*ice)(?=.*x)(?=.*vel)"}}
ds = xr.Dataset()
ds["sea_ice_velocity_x"] = [0,1,2]

with cfx.set_options(custom_criteria=vocab):
    ds.cf["sea_ice_u"]

Currently returns

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 2034, in __getitem__
    return _getitem(self, key)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 685, in _getitem
    names = _get_all(obj, k)
            ^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 385, in _get_all
    results = apply_mapper(all_mappers, obj, key, error=False, default=None)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 117, in apply_mapper
    results.append(_apply_single_mapper(mapper))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 101, in _apply_single_mapper
    results = mapper(obj, key)
              ^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/site-packages/cf_xarray/accessor.py", line 214, in _get_custom_criteria
    if re.match(patterns, obj[var].attrs.get(criterion, "")):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/__init__.py", line 166, in match
    return _compile(pattern, flags).match(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/__init__.py", line 294, in _compile
    p = _compiler.compile(pattern, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/_compiler.py", line 743, in compile
    p = _parser.parse(p, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/_parser.py", line 980, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/_parser.py", line 455, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kthyng/miniconda3/envs/omsa3/lib/python3.11/re/_parser.py", line 841, in _parse
    raise source.error('global flags not at the start '
re.error: global flags not at the start of the expression at position 48

But if I replace re with regex (and do some renaming since the variable holding regular expressions in accessor.py is also called "regex") I get back:

<xarray.DataArray 'sea_ice_velocity_x' (sea_ice_velocity_x: 3)>
array([0, 1, 2])
Coordinates:
  * sea_ice_velocity_x  (sea_ice_velocity_x) int64 0 1 2

I suppose there is a reason that re doesn't allow this anymore but I would prefer to be able to do so! What do others think? @dcherian you might be the other person who has used custom vocabularies?

I don't know the differences, but since regex is backwards compatible, we could optionally use it if available.

So

try:
	from regex import match
except ImportError:
	from re import match

We can add regex to the optional-deps environment for testing.

PR welcome!