skvadrik/re2c

Re2go fail when compiling expression containing combining character

Closed this issue · 5 comments

I'm trying to generate Go code for regular expression güncellen?me. I made my re2go input file like this:

package main

import (
	"fmt"
)

func parse(str string) {
	var cur, mar int

	for cur < len(str) {
		/*!re2c

		re2c:tags               = 1;
		re2c:yyfill:enable      = 0;
		re2c:define:YYCTYPE     = byte;
		re2c:define:YYPEEK      = "str[cur]";
		re2c:define:YYSKIP      = "cur += 1";
		re2c:define:YYBACKUP    = "mar = cur";
		re2c:define:YYRESTORE   = "cur = mar";

		güncellen?me {
			fmt.Println("MATCH FOUND")
			return
		}
		[\000] { return }
		*      { continue }
		*/
	}
}

func main() {
	parse("<html><body><p><em>Son güncelleme: 5/5/2020</em></p></body></html>\000")
}

Then I run re2go like this:

re2go -u --flex-syntax -i main.go -o hmm.go

Unfortunately it always fail with following message:

input.go:25:5: error: unexpected character: '�'

Any tips on how to solve this issue? Thanks!

Nevermind. Was stupid, found #237, got smarter.

For other people who stumbled with same issue, change your grammar config like this:

g[ü]ncellen?me {
	fmt.Println("MATCH FOUND")
	return
}

Then run re2go with --input-encoding utf8 flag:

re2go --input-encoding utf8 -u --flex-syntax -i input.re -o main.go

Right, for UTF-8 encoded source code, use --input-encoding utf8.

For UTF-8 encoded input, use --utf8 / re2c:encoding:utf8 = 1; (you used -u which is not UTF-8, but UTF-32, which means that your lexer is generated for UTF-32 encoded input). At some point re2c had only short options -u, -8 and so on, but now it has less confusing aliases --utf32, --utf8, etc., as well as configurations for these options.

Flex syntax support is somewhat rudimentary in re2c, e.g. in this case it should have worked with güncellen?me but the more recently added --input-encoding option did not play well with --flex-support (here in the source code re2c consumes one byte at a time, disregarding the possibility of multibyte characters). This is actually a bug, so I'm reopening this issue to fix it.

I'm glad that g[ü]ncellen?me worked out. Alternatively you can just use re2c-native syntax "güncelle" "n"? "me" which is fully supported, and avoid any further potential issues with flex-like syntax.

@RadhiFadlillah Also not that for cur < len(str) is not a correct way of handling the end of input, you can replace it with just for. It is the sentinel rule [\000] { return } that stops the lexer. More info here: http://re2c.org/manual/manual_go.html#handling-the-end-of-input.

Here's a fix: cbd52e0, it will be merged into master once it passes all the CI checks.

I'll close this bug, please reopen if you have any further issues.