blevesearch/bleve

panic in analysis/tokenizers/icu

Closed this issue · 8 comments

I am using bleve to index some Danish text using the github.com/blevesearch/bleve/analysis/language/da package. I get a

panic: runtime error: slice bounds out of range

which occurs at

github.com/blevesearch/bleve/analysis/tokenizers/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc208046860, 0xc20ef4f440, 0xbd, 0xbd, 0x0, 0x0, 0x0)
    /.../src/github.com/blevesearch/bleve/analysis/tokenizers/icu/boundary.go:104 +0x621

running on Max OSX. I have icu4c 54.1 installed using brew.

I noticed that U_BUFFER_OVERFLOW_ERROR is being ignored (lines 85 and 93). When I modify:

if err > C.U_ZERO_ERROR && err != C.U_BUFFER_OVERFLOW_ERROR {

to be

if err > C.U_ZERO_ERROR {

then the panic disappears. I haven't isolated what specific text causes the panic yet, so I haven't been able to provide an example to reproduce the panic.

I isolated the problem to text that was saved in another encoding - not UTF-8. As a result it had an illegal UTF-8 character.

Is there some way avoid panicking in this scenario?

Thanks for tracking it down. I plan on reviewing the code tomorrow, I took a quick look but couldn't quite remember why it works the way it does.

Can you supply the problematic text? Obviously we should try not to panic in this case, but it might end up in form of an optional utf-8 validation step. Right now we expect/trust input to always be valid utf-8. Obviously thats not always the case or something that can be guaranteed.

CC'ing @steveyen because I think he will run into this soon too.

The problematic text was like this:

something\96something

(That is supposed to represent a code point 96 embedded there.)

This is an en-dash in Windows-1252 encoding.

Note: Don't follow my example and remove U_BUFFER_OVERFLOW_ERROR in the conditional. It completely breaks indexing. My bad for not further testing before posting.

Interesting I cannot reproduce a panic under similar conditions, though I have icu 52.1.

For input:

"something\x96something"

I get the token stream:

analysis.TokenStream{
                {
                    Start:    0,
                    End:      9,
                    Term:     []byte("something"),
                    Position: 1,
                    Type:     analysis.AlphaNumeric,
                },
                {
                    Start:    12,
                    End:      21,
                    Term:     []byte("mething\x00\x00"),
                    Position: 2,
                    Type:     analysis.AlphaNumeric,
                },
            },

This is obviously wrong, but it didn't panic. It could be that icu 54.1 behaves differently, or it could be that in my test I'm somehow just getting lucky. I have to review the icu api to see what we can do differently.

I've also opened up a related issue, because Bleve's handling of invalid utf-8 goes beyond just this one issue: #186

Hi,

I had the same exception and fixed it by adding

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

to the icu package. On my windows machine it always treated the input as ANSI not as UTF-8... Hope that helps anyone :)

Thanks, yes that is related. The current implementation relies on the default being UTF-8, which then leads to another class of bugs, where the input is valid utf-8, but your default isn't set to utf-8.

We can either set the default to utf-8 (as you suggest) or use another function which explicitly treats the input as utf-8.

Now that the ICU tokenizer is part of blevex, moving this issue there: blevesearch/blevex#34