golang/go

regexp: confusing behavior on invalid utf-8 sequences

dvyukov opened this issue · 2 comments

The following program:

package main

import "regexp"

func main() {
    re := regexp.MustCompile(".")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
    re = regexp.MustCompile("..")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
}

prints:

true
true
true
false
false
true

While the following C++ program:

#include <stdio.h>
#include <re2/re2.h>

int main() {
    RE2 re1(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
    RE2 re2(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}

prints:

0
1
0
0
1
0

This raises 2 questions:

  1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
  2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

Here are other examples of disagreement between regexp and re2 for invalid utf-8:

re=".$" str="\xb1\x98" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re=".*(..b)." str="(.a|.b\xdb|" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="\\Q\xb4\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="\\QT\x82\\E\\QT\\E" str="c^|^\\QTt\\c" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="^((?:.*)+?(?:.*)+?)$" str="\xff\xbf\x80\x80$^^.^^^^((?.^^^" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="\\Q\x8a-" str="o\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="." str="\xd6" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="[^-9]+z" str="\xbfz)^(?:" regexp=true re2=false
panic: regexp and re2 disagree on regexp match
rsc commented

In Go, "." matches a single malformed UTF-8 sequence; in RE2 it does not. This is mainly due to the implementation details of each but I wouldn't change either now.

As for the second question, "xx" matches against both "." and ".." too.