r-lib/rex

at_most generates invalid regex

DarwinAwardWinner opened this issue · 4 comments

It seems that the at_most quantified in rex produces an invalid regex that causes a U_REGEX_BAD_INTERVAL error when used. The fix seems to be to use a zero for the minimum quantity, as shown in good_regex.

library(rex)
library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rex':
#> 
#>     regex
bad_regex <- rex(start, at_most(any, 40), or(letters))
good_regex <- '^(?:.){0,40}(?:[[:alpha:]]+)'
x <- "1234567890123456789012345678901234567890abc"
str_detect(x, good_regex)
#> [1] TRUE
str_detect(x, bad_regex)
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^(?:.){,40}(?:[[:alpha:]]+)`)

Created on 2021-03-12 by the reprex package (v1.0.0)

stringr and stringi use ICU regular expressions, the rex package produces PCRE, as are used by the base R functions with perl = TRUE, e.g. regexpr(perl = TRUE), the syntax is not the same.

Oh that's right, they're so similar I forgot there are 2 different syntaxes. This is one of the first discrepancies I've run into, I think. For what it's worth, I think including the zero for at_most as in good_regex above is still correct in PCRE while also being correct in ICU syntax as well, so if you are open to such changes, I'd like to request that this one be considered. However, if you don't want to encourage people to further confuse the two, I'd understand and respect that choice.

Edit: If you are willing to consider such a change, I'd be happy to attempt the fix myself and submit a pull request.

I'd be open to a PR that makes the generated regular expressions compatible with both versions. Presumedly it's as simple as using "0" instead of "" here:

https://github.com/kevinushey/rex/blob/7d174c0d2045162aae7ffc4be75e244a22add73a/R/counts.R#L36-L39

That said, maybe I'm misunderstanding since the current generated regular expression in this case is not handled with grep(perl = TRUE); e.g.

> pattern <- rex(at_most(any, 40))
> pattern
(?:.){,40}
> grep(pattern, "abc")
[1] 1
> grep(pattern, "abc", perl = TRUE)
integer(0)

Indeed, it looks like the current implementation is actually wrong in base R functions when perl = TRUE. So it appears this does need to be fixed. I believe the fix suggested by @kevinushey is the correct one, since between(..., 0, 40) gives the correct result.

library(rex)
library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rex':
#> 
#>     regex

x <- c(
  "abc",
  "12345678901234567890abc",
  "1234567890123456789012345678901234567890abc",
  "123456789012345678901234567890123456789012345678901234567890abc",
  "1234567890123456789012345678901234567890"
)
good_regex <- rex(start, between(any, 0, 40), or(letters))
bad_regex <- rex(start, at_most(any, 40), or(letters))
good_regex
#> ^(?:.){0,40}(?:[[:alpha:]]+)
bad_regex
#> ^(?:.){,40}(?:[[:alpha:]]+)
str_detect(x, good_regex)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE
str_detect(x, bad_regex)
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^(?:.){,40}(?:[[:alpha:]]+)`)
grepl(good_regex, x, perl = FALSE)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE
grepl(bad_regex, x, perl = FALSE)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE
grepl(good_regex, x, perl = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE
grepl(bad_regex, x, perl = TRUE)
#> [1] FALSE FALSE FALSE FALSE FALSE

Created on 2021-03-12 by the reprex package (v1.0.0)