astral-sh/ruff

[`ruff`] Expand `unnecessary-regular-expression` with `re.compile` (RUF055)

Opened this issue · 5 comments

Often users use re.compile to compile a regex pattern. The returned regular expression object can be then used for the methods already caught by this rule. This is more efficient when the same pattern is reused.

Detecting these cases can be done with the same logic as in RUF055. Providing a fix involves more work, as the compiled pattern can be dynamically passed. Having the detection without the fix is already valuable. Alternatively, this could be a separate rule.

 re.compile(r"^tinyint", re.IGNORECASE)

without regex:

def tinyint(s: str) -> bool:
    return s.lower().startswith("tinyint")

Examples:

cc: @ntBre

I'm not sure if I fully understand the proposal. Is it to detect any usage of re.compile(r"^tinyint", re.IGNORECASE) or only usages in boolean positions?

I'm asking because I'm not sure if the regex patterns can easily be replaced in the examples you linked because the compiled expressions are passed to some generic testing function.

ntBre commented

Hmm, I thought this sounded straightforward at first (just catch re.compile with a literal non-meta-character pattern), but I think you really have to know where it's used to see if this is a problem. If any of the Match APIs are used, for example, the results will still be different from any plain str version.

For example, this could trigger the rule:

pat = re.compile("xyz")
if pat.match(s):
    pass

but we'd want something like this not to:

pat = re.compile("xyz")
m = pat.match(s)
# use m ...

Basically I think we'd need to extend #14679 to resolve string literals through a re.compile call.

Indeed, the examples above are not as clear.

A true positive example that would be great to be able to catch: https://github.com/great-expectations/great_expectations/blob/develop/ci/checks/check_only_name_tag_snippets.py#L48

A true positive example that would be great to be able to catch: great-expectations/great_expectations@develop/ci/checks/check_only_name_tag_snippets.py#L48

I would say even that is not entirely clear-cut: because it's a global variable, it could be read from another module, and you don't know how that other module might use the re.Pattern object

Good point. Even though this example should use string operations instead of regexes, we can't be sure without analysing the other files.
An example where the compiled regex is in the function scope:
https://github.com/mit-biomimetics/Cheetah-Software/blob/master/scripts/lcm-log2smat/python/lcmlog2smat/scan_for_lcmtypes.py#L12

Some obvious cases:

https://github.com/EmpireProject/Empire/blob/master/data/agent/agent.py#L927
https://github.com/johnlane/abcde/blob/master/examples/abcde.py#L96

After looking at more examples, I'm inclined to think that a dedicated rule with warning severity (#1256) that only considers the re.compile with a literal non-meta-character pattern would better suit this proposal, rather than extending RUF055.