icu tokenizer may panic on invalid UTF-8
Opened this issue · 4 comments
When the icu tokenizer gets invalid utf8 input like:
"something\x96something"
You may get a panic. This seems to depend on the version of ICU you have installed, and may also depend on some default ICU settings and/or environment variables.
Some users have reported that adding the following fixes the issue for them.
// #include "unicode/ucnv.h"
func init() {
C.ucnv_setDefaultName(C.CString("UTF-8"))
}
This issue has been moved from the bleve repo: blevesearch/bleve#185
same related error maybe ?
go get github.com/blevesearch/blevex/icu
# github.com/blevesearch/blevex/icu
../../../blevesearch/blevex/icu/boundary.go:15:11: fatal error: 'unicode/utypes.h' file not found
#include "unicode/utypes.h"
^
1 error generated.
Is it because there is a dependency i need to install maybe ?
By running go test
in github.com/blevesearch/blevex/lang/th
I see this panic too in my system,
panic: runtime error: slice bounds out of range
goroutine 21 [running]:
github.com/blevesearch/blevex/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc4200b4178, 0xc4205dc000, 0x31a, 0x31a, 0x0, 0x0, 0x0)
/var/www/go/src/github.com/blevesearch/blevex/icu/boundary.go:103 +0x67b
github.com/blevesearch/bleve/analysis.(*Analyzer).Analyze(0xc4200b8780, 0xc4205dc000, 0x31a, 0x31a, 0x31a, 0x31a, 0x7cdf9b7326f47234)
/var/www/go/src/github.com/blevesearch/bleve/analysis/type.go:86 +0xcc
github.com/blevesearch/bleve/document.(*TextField).Analyze(0xc42052f920, 0xf, 0xc4205628fe)
/var/www/go/src/github.com/blevesearch/bleve/document/field_text.go:72 +0x86
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze.func1(0x9f6e20, 0xc42052f920, 0x1)
/var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:48 +0x35b
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze(0xc4201c2300, 0xc42051ca80, 0xc420562f38)
/var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:70 +0x414
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42008e120, 0xc42008e180)
/var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:106 +0x55
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
/var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:94 +0xcd
I confirmed the issue is fixed by adding these lines into blevex/icu/boundary.go
// #include "unicode/ucnv.h"
func init() {
C.ucnv_setDefaultName(C.CString("UTF-8"))
}
@mschoch Do you have any plan to include this patch into main stream? It would be really nice, thank you.
Thanks @atthakorn -- wondering if for anybody also running into this and who need a temporary workaround, I'd wonder if those lines of init() code are also just invokable from any app code.
Wow I did try, following lines are able to be invoked in app code and it works fine. Great thanks (i'm new to Go)
// #cgo LDFLAGS: -licuuc -licudata
// #include "unicode/ucnv.h"
import "C"
func init() {
C.ucnv_setDefaultName(C.CString("UTF-8"))
}
However, to leave more trail to others , due to blevesearch/blevex
is not supported vendoring, at least I try on dep
but it failed to meet constraints
$dep ensure -add github.com/blevesearch/blevex
Solving failure: No versions of github.com/blevesearch/blevex met constraints:
To install blevesearch/blevex
as extenstion, workaround can be either
-
copied
blevex
locally as internal package:internal/blevex/icu
, This option is minimal as we can grab only desired extensions e.g. my case i'm usingblevex/icu
andblevex/lang/th
module. -
make
blevesearch/blevex
as git submodule
wherever blevex modules are: copy to local
or submodule
we can put this workaround patch into separated file e.g. blevex-icu-patch in any package in app layer
So we don't pollute core extension and keep code clean.