r-lib/rex

Case-insensitive regex?

Opened this issue · 5 comments

Is there a way to specify a regex is case-insensitive in {rex}?

We are passing it to list.files(pattern=) so the normal arguments are not available -- the only approach would be to add (?i) AFAICT. Without it, the regex is substantially gnarlier:

rex(".", or(group(one_of("Rr"), or("", "html", "md", "nw", "rst", "tex", "txt")), "Qmd", "qmd"), end)
# vs (not exactly the same, but that's fine)
rex(".", ignore_case(or("r", "rhtml", "rmd", "qmd", "rnw", "rrst", "rtex", "rtxt")), end)

I don't think there is, but I agree this would be nice to have.

The main thing I'm not aware of -- does (?i) apply to the whole regular expression, or to just the following "piece", or something else? What's the best syntax to adopt here?

It can be de-activated with (?-i), e.g.

grepl("(?i)a(?-i)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE  TRUE FALSE  TRUE

From ?regex:

Perl-like matching can work in several modes, set by the options (?i)⁠ (caseless, equivalent to Perl's /i), ⁠(?m)⁠ (multiline, equivalent to Perl's /m⁠), ⁠(?s) (single line, so a dot matches all characters, even new lines: equivalent to Perl's /s⁠) and ⁠(?x) (extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's /x⁠). These can be concatenated, so for example, ⁠(?im) sets caseless multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as ⁠(?im-sx)⁠. These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include (?U)⁠ to set ‘ungreedy’ mode (so matching is minimal unless ⁠?⁠ is used as part of the repetition quantifier, when it is greedy). Initially none of these options are set.

It also applies locally within a group:

grepl("((?i)a)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE  TRUE FALSE  TRUE

It looks like we can chain the modes, but only in perl=TRUE:

grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = FALSE)
# Error in grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"),  : 
#   invalid regular expression '(?i)(?m)a.a(?-m)(?-i)', reason 'Invalid regexp'
# In addition: Warning message:
# In grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"),  :
#   TRE pattern compilation error 'Invalid regexp'
grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = TRUE)
# [1] FALSE FALSE  TRUE  TRUE

So maybe the simplest implementation is (imitating rex:::group()):

ignore_case <- function(...) p("(?i)", p(escape_dots(...)), "(?-i)")

Alternatively (or perhaps additionally), we could unify an expression for modes:

regex_mode <- function(mode = c("ignore_case", "multiline", "single_line", "extended", "ungreedy"), ...) {
  mode <- unique(match.arg(mode, several.ok = TRUE))
  modes <- p(c(ignore_case = "i", multiline = "m", single_line = "s", extended = "x", ungreedy = "U")[mode])
  p("(?", modes, ")", p(escape_dots(...)), "(?-", modes, ")")
}

Or some other design that allows toggling modes on/off, like start_mode(c("ignore_case", "extended")) then end_mode("extended")...

Hmm, I see we have access to these through match(options = ) already:

rex/R/match.R

Lines 145 to 151 in 7148a0c

option_map <- c(
"insensitive" = "i",
"multi-line" = "m",
"single-line" = "s",
"extended" = "x",
"ungreedy" = "U"
)

So we just need an interface to apply this directly to the regex, since we won't always be executing with matches(). But we should be consistent with the existing interface.