blevesearch/blevex

icu tokenizer may panic on invalid UTF-8

Opened this issue · 4 comments

When the icu tokenizer gets invalid utf8 input like:

"something\x96something"

You may get a panic. This seems to depend on the version of ICU you have installed, and may also depend on some default ICU settings and/or environment variables.

Some users have reported that adding the following fixes the issue for them.

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

This issue has been moved from the bleve repo: blevesearch/bleve#185

same related error maybe ?

go get github.com/blevesearch/blevex/icu
# github.com/blevesearch/blevex/icu
../../../blevesearch/blevex/icu/boundary.go:15:11: fatal error: 'unicode/utypes.h' file not found
 #include "unicode/utypes.h"
          ^
1 error generated.

Is it because there is a dependency i need to install maybe ?

By running go test in github.com/blevesearch/blevex/lang/th

I see this panic too in my system,

panic: runtime error: slice bounds out of range

goroutine 21 [running]:
github.com/blevesearch/blevex/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc4200b4178, 0xc4205dc000, 0x31a, 0x31a, 0x0, 0x0, 0x0)
        /var/www/go/src/github.com/blevesearch/blevex/icu/boundary.go:103 +0x67b
github.com/blevesearch/bleve/analysis.(*Analyzer).Analyze(0xc4200b8780, 0xc4205dc000, 0x31a, 0x31a, 0x31a, 0x31a, 0x7cdf9b7326f47234)
        /var/www/go/src/github.com/blevesearch/bleve/analysis/type.go:86 +0xcc
github.com/blevesearch/bleve/document.(*TextField).Analyze(0xc42052f920, 0xf, 0xc4205628fe)
        /var/www/go/src/github.com/blevesearch/bleve/document/field_text.go:72 +0x86
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze.func1(0x9f6e20, 0xc42052f920, 0x1)
        /var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:48 +0x35b
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze(0xc4201c2300, 0xc42051ca80, 0xc420562f38)
        /var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:70 +0x414
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42008e120, 0xc42008e180)
        /var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:106 +0x55
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
        /var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:94 +0xcd

I confirmed the issue is fixed by adding these lines into blevex/icu/boundary.go

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

@mschoch Do you have any plan to include this patch into main stream? It would be really nice, thank you.

Thanks @atthakorn -- wondering if for anybody also running into this and who need a temporary workaround, I'd wonder if those lines of init() code are also just invokable from any app code.

@steveyen

Wow I did try, following lines are able to be invoked in app code and it works fine. Great thanks (i'm new to Go)

// #cgo LDFLAGS: -licuuc -licudata
// #include "unicode/ucnv.h"
import "C"

func init() {
	C.ucnv_setDefaultName(C.CString("UTF-8"))
}

However, to leave more trail to others , due to blevesearch/blevex is not supported vendoring, at least I try on dep but it failed to meet constraints

$dep ensure -add github.com/blevesearch/blevex

Solving failure: No versions of github.com/blevesearch/blevex met constraints:

To install blevesearch/blevex as extenstion, workaround can be either

  1. copied blevex locally as internal package: internal/blevex/icu , This option is minimal as we can grab only desired extensions e.g. my case i'm using blevex/icu and blevex/lang/th module.

  2. make blevesearch/blevex as git submodule

wherever blevex modules are: copy to local or submodule we can put this workaround patch into separated file e.g. blevex-icu-patch in any package in app layer

So we don't pollute core extension and keep code clean.