Re2go fail when compiling expression containing combining character
RadhiFadlillah opened this issue · 5 comments
I'm trying to generate Go code for regular expression güncellen?me
. I made my re2go input file like this:
package main
import (
"fmt"
)
func parse(str string) {
var cur, mar int
for cur < len(str) {
/*!re2c
re2c:tags = 1;
re2c:yyfill:enable = 0;
re2c:define:YYCTYPE = byte;
re2c:define:YYPEEK = "str[cur]";
re2c:define:YYSKIP = "cur += 1";
re2c:define:YYBACKUP = "mar = cur";
re2c:define:YYRESTORE = "cur = mar";
güncellen?me {
fmt.Println("MATCH FOUND")
return
}
[\000] { return }
* { continue }
*/
}
}
func main() {
parse("<html><body><p><em>Son güncelleme: 5/5/2020</em></p></body></html>\000")
}
Then I run re2go like this:
re2go -u --flex-syntax -i main.go -o hmm.go
Unfortunately it always fail with following message:
input.go:25:5: error: unexpected character: '�'
Any tips on how to solve this issue? Thanks!
Nevermind. Was stupid, found #237, got smarter.
For other people who stumbled with same issue, change your grammar config like this:
g[ü]ncellen?me {
fmt.Println("MATCH FOUND")
return
}
Then run re2go with --input-encoding utf8
flag:
re2go --input-encoding utf8 -u --flex-syntax -i input.re -o main.go
Right, for UTF-8 encoded source code, use --input-encoding utf8
.
For UTF-8 encoded input, use --utf8
/ re2c:encoding:utf8 = 1;
(you used -u
which is not UTF-8, but UTF-32, which means that your lexer is generated for UTF-32 encoded input). At some point re2c had only short options -u
, -8
and so on, but now it has less confusing aliases --utf32
, --utf8
, etc., as well as configurations for these options.
Flex syntax support is somewhat rudimentary in re2c, e.g. in this case it should have worked with güncellen?me
but the more recently added --input-encoding
option did not play well with --flex-support
(here in the source code re2c consumes one byte at a time, disregarding the possibility of multibyte characters). This is actually a bug, so I'm reopening this issue to fix it.
I'm glad that g[ü]ncellen?me
worked out. Alternatively you can just use re2c-native syntax "güncelle" "n"? "me"
which is fully supported, and avoid any further potential issues with flex-like syntax.
@RadhiFadlillah Also not that for cur < len(str)
is not a correct way of handling the end of input, you can replace it with just for
. It is the sentinel rule [\000] { return }
that stops the lexer. More info here: http://re2c.org/manual/manual_go.html#handling-the-end-of-input.
Here's a fix: cbd52e0, it will be merged into master once it passes all the CI checks.
I'll close this bug, please reopen if you have any further issues.