Case-insensitive regex?
Opened this issue · 5 comments
Is there a way to specify a regex is case-insensitive in {rex}?
We are passing it to list.files(pattern=)
so the normal arguments are not available -- the only approach would be to add (?i)
AFAICT. Without it, the regex is substantially gnarlier:
rex(".", or(group(one_of("Rr"), or("", "html", "md", "nw", "rst", "tex", "txt")), "Qmd", "qmd"), end)
# vs (not exactly the same, but that's fine)
rex(".", ignore_case(or("r", "rhtml", "rmd", "qmd", "rnw", "rrst", "rtex", "rtxt")), end)
I don't think there is, but I agree this would be nice to have.
The main thing I'm not aware of -- does (?i)
apply to the whole regular expression, or to just the following "piece", or something else? What's the best syntax to adopt here?
It can be de-activated with (?-i)
, e.g.
grepl("(?i)a(?-i)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE TRUE FALSE TRUE
From ?regex
:
Perl-like matching can work in several modes, set by the options
(?i)
(caseless, equivalent to Perl's/i
),(?m)
(multiline, equivalent to Perl's/m
),(?s)
(single line, so a dot matches all characters, even new lines: equivalent to Perl's/s
) and(?x)
(extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's/x
). These can be concatenated, so for example,(?im)
sets caseless multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as(?im-sx)
. These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include(?U)
to set ‘ungreedy’ mode (so matching is minimal unless?
is used as part of the repetition quantifier, when it is greedy). Initially none of these options are set.
It also applies locally within a group:
grepl("((?i)a)A", c("aa", "aA", "Aa", "AA"))
# [1] FALSE TRUE FALSE TRUE
It looks like we can chain the modes, but only in perl=TRUE
:
grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = FALSE)
# Error in grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), :
# invalid regular expression '(?i)(?m)a.a(?-m)(?-i)', reason 'Invalid regexp'
# In addition: Warning message:
# In grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), :
# TRE pattern compilation error 'Invalid regexp'
grepl("(?i)(?m)a.a(?-m)(?-i)", c("a\nA", "a\na", "a-A", "a-a"), perl = TRUE)
# [1] FALSE FALSE TRUE TRUE
So maybe the simplest implementation is (imitating rex:::group()
):
ignore_case <- function(...) p("(?i)", p(escape_dots(...)), "(?-i)")
Alternatively (or perhaps additionally), we could unify an expression for modes:
regex_mode <- function(mode = c("ignore_case", "multiline", "single_line", "extended", "ungreedy"), ...) {
mode <- unique(match.arg(mode, several.ok = TRUE))
modes <- p(c(ignore_case = "i", multiline = "m", single_line = "s", extended = "x", ungreedy = "U")[mode])
p("(?", modes, ")", p(escape_dots(...)), "(?-", modes, ")")
}
Or some other design that allows toggling modes on/off, like start_mode(c("ignore_case", "extended"))
then end_mode("extended")
...
Hmm, I see we have access to these through match(options = )
already:
Lines 145 to 151 in 7148a0c
So we just need an interface to apply this directly to the regex, since we won't always be executing with matches()
. But we should be consistent with the existing interface.