GRbit/go-pcre

Unicode in character class sometimes not working

ukolovda opened this issue · 4 comments

I've hot very strange bug with unicode symbols in PCRE character classes.

I wrote small test:

func TestUnicodeAndClass(t *testing.T) {
	// Simple unicode works
	re := MustCompile(`ййй`, 0)
	m := re.NewMatcherString(`ййй`, 0)
	if !m.Matches {
		t.Error("Failed to find any matches")
	}

	// But with char class not working...
	re = MustCompile(`й[й]й`, 0)
	m = re.NewMatcherString(`ййй`, 0)
	if !m.Matches {
		t.Error("Failed to find any matches")
	}
}

(see https://github.com/ukolovda/go-pcre/tree/unicode-class-bug )

When I remove first or last symbol from the pattern, it works.

If I set flag UTF8, it works:

func TestUnicodeAndClass(t *testing.T) {
	re := MustCompile(`ййй`, 0)
	m := re.NewMatcherString(`ййй`, 0)
	if !m.Matches {
		t.Error("Failed to find any matches")
	}

	const PCRE_CONFIG_UTF8 int = 0x800
	re = MustCompile(`й[й]й`, PCRE_CONFIG_UTF8)
	m = re.NewMatcherString(`ййй`, 0)
	if !m.Matches {
		t.Error("Failed to find any matches")
	}
}

I try make PR for this constant in the library.

The alternative is change pattern and use CompileParse function:

	re = MustCompileParse(`(?u)й[й]й`)

Flag UTF8 already exist, sorry:

	re = MustCompile(`й[й]й`, UTF8)

working too.

Closing