regexp: confusing behavior on invalid utf-8 sequences
dvyukov opened this issue · 2 comments
dvyukov commented
The following program:
package main
import "regexp"
func main() {
re := regexp.MustCompile(".")
println(re.MatchString("\xd1"))
println(re.MatchString("\xd1\x84"))
println(re.MatchString("\xd1\xd1"))
re = regexp.MustCompile("..")
println(re.MatchString("\xd1"))
println(re.MatchString("\xd1\x84"))
println(re.MatchString("\xd1\xd1"))
}
prints:
true
true
true
false
false
true
While the following C++ program:
#include <stdio.h>
#include <re2/re2.h>
int main() {
RE2 re1(".");
printf("%d\n", RE2::PartialMatch("\xd1", re1));
printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
RE2 re2(".");
printf("%d\n", RE2::PartialMatch("\xd1", re2));
printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}
prints:
0
1
0
0
1
0
This raises 2 questions:
- Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
- Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?
go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64
dvyukov commented
Here are other examples of disagreement between regexp and re2 for invalid utf-8:
re=".$" str="\xb1\x98" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
re=".*(..b)." str="(.a|.b\xdb|" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
re="\\Q\xb4\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity
re="\\QT\x82\\E\\QT\\E" str="c^|^\\QTt\\c" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity
re="^((?:.*)+?(?:.*)+?)$" str="\xff\xbf\x80\x80$^^.^^^^((?.^^^" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
re="\\Q\x8a-" str="o\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity
re="." str="\xd6" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
re="[^-9]+z" str="\xbfz)^(?:" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
rsc commented
In Go, "." matches a single malformed UTF-8 sequence; in RE2 it does not. This is mainly due to the implementation details of each but I wouldn't change either now.
As for the second question, "xx" matches against both "." and ".." too.