doc clarification: confusing match behavior for non-existent ASCII character classes
dawnofmidnight opened this issue · 1 comments
Crate version: 1.11.0
Example code: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c4b4cfe18c2e6413444e53315de33b27 (used for snippets below and extra checks)
The behavior of the crate when trying to use the ASCII character class syntax [[:foo:]] with invalid character classes is somewhat confusing. A friend was trying to use [[:XID_Start:]] to check whether _ (underscore/low line) was included in the XID_Start character class (it's not), and was confused when it returned true.
let expr = regex::Regex::new(r"[[:XID_Start:]]").unwrap();
dbg!(expr.is_match("_")); // trueThe correct syntax, \p{XID_Start}, does work correctly:
let correct = regex::Regex::new(r"\p{XID_Start}").unwrap();
dbg!(correct.is_match("a")); // true
dbg!(correct.is_match("1")); // false
dbg!(correct.is_match("_")); // falseIt seems that when the class is invalid for an ASCII character class (regex § ASCII character classes), it falls back to marking any character present within the brackets as true:
dbg!(expr.is_match(":")); // true
dbg!(expr.is_match("X")); // true
dbg!(expr.is_match("x")); // false
dbg!(expr.is_match("a")); // true
dbg!(expr.is_match("b")); // false
dbg!(expr.is_match("[")); // false
dbg!(expr.is_match("]")); // falseI'm not entirely sure what regex is actually interpreting this sequence as, but, assuming this is intentional behavior, I think that it might be something that is worth documenting in the aforementioned section on ASCII character classes in the docs, as the behavior is not immediately intuitive.
Yes the behavior is unfortunate but intentional for compatibility with how other regex engines work. In retrospect, I would have rathered being a bit more strict here to produce errors for unrecognized classes.
I agree that adding a note to the docs about this would be a good idea.