at_most generates invalid regex
DarwinAwardWinner opened this issue · 4 comments
It seems that the at_most
quantified in rex produces an invalid regex that causes a U_REGEX_BAD_INTERVAL
error when used. The fix seems to be to use a zero for the minimum quantity, as shown in good_regex
.
library(rex)
library(stringr)
#>
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rex':
#>
#> regex
bad_regex <- rex(start, at_most(any, 40), or(letters))
good_regex <- '^(?:.){0,40}(?:[[:alpha:]]+)'
x <- "1234567890123456789012345678901234567890abc"
str_detect(x, good_regex)
#> [1] TRUE
str_detect(x, bad_regex)
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^(?:.){,40}(?:[[:alpha:]]+)`)
Created on 2021-03-12 by the reprex package (v1.0.0)
stringr and stringi use ICU regular expressions, the rex package produces PCRE, as are used by the base R functions with perl = TRUE
, e.g. regexpr(perl = TRUE)
, the syntax is not the same.
Oh that's right, they're so similar I forgot there are 2 different syntaxes. This is one of the first discrepancies I've run into, I think. For what it's worth, I think including the zero for at_most
as in good_regex
above is still correct in PCRE while also being correct in ICU syntax as well, so if you are open to such changes, I'd like to request that this one be considered. However, if you don't want to encourage people to further confuse the two, I'd understand and respect that choice.
Edit: If you are willing to consider such a change, I'd be happy to attempt the fix myself and submit a pull request.
I'd be open to a PR that makes the generated regular expressions compatible with both versions. Presumedly it's as simple as using "0"
instead of ""
here:
https://github.com/kevinushey/rex/blob/7d174c0d2045162aae7ffc4be75e244a22add73a/R/counts.R#L36-L39
That said, maybe I'm misunderstanding since the current generated regular expression in this case is not handled with grep(perl = TRUE)
; e.g.
> pattern <- rex(at_most(any, 40))
> pattern
(?:.){,40}
> grep(pattern, "abc")
[1] 1
> grep(pattern, "abc", perl = TRUE)
integer(0)
Indeed, it looks like the current implementation is actually wrong in base R functions when perl = TRUE
. So it appears this does need to be fixed. I believe the fix suggested by @kevinushey is the correct one, since between(..., 0, 40)
gives the correct result.
library(rex)
library(stringr)
#>
#> Attaching package: 'stringr'
#> The following object is masked from 'package:rex':
#>
#> regex
x <- c(
"abc",
"12345678901234567890abc",
"1234567890123456789012345678901234567890abc",
"123456789012345678901234567890123456789012345678901234567890abc",
"1234567890123456789012345678901234567890"
)
good_regex <- rex(start, between(any, 0, 40), or(letters))
bad_regex <- rex(start, at_most(any, 40), or(letters))
good_regex
#> ^(?:.){0,40}(?:[[:alpha:]]+)
bad_regex
#> ^(?:.){,40}(?:[[:alpha:]]+)
str_detect(x, good_regex)
#> [1] TRUE TRUE TRUE FALSE FALSE
str_detect(x, bad_regex)
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^(?:.){,40}(?:[[:alpha:]]+)`)
grepl(good_regex, x, perl = FALSE)
#> [1] TRUE TRUE TRUE FALSE FALSE
grepl(bad_regex, x, perl = FALSE)
#> [1] TRUE TRUE TRUE FALSE FALSE
grepl(good_regex, x, perl = TRUE)
#> [1] TRUE TRUE TRUE FALSE FALSE
grepl(bad_regex, x, perl = TRUE)
#> [1] FALSE FALSE FALSE FALSE FALSE
Created on 2021-03-12 by the reprex package (v1.0.0)