duydo/elasticsearch-analysis-vietnamese

Tokenizer "vi_tokenizer" doesn't work with character filter "html_strip"

seta-hainguyen opened this issue · 3 comments

Here is my settings for analyzer:

"vn_html_analyzer": { "filter": [ "icu_folding" ], "char_filter": [ "html_strip" ], "type": "custom", "tokenizer": "vi_tokenizer" }

When I tried:
GET localhost:9200/question/_analyze { "analyzer" : "vn_html_analyzer", "text" : "<p>đỗ đại học</p>" }

It throws error:
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[7d1c0721c1d6][172.17.0.2:9300][indices:admin/analyze[s]]"}],"type":"string_index_out_of_bounds_exception","reason":"String index out of range: -1"},"status":500}

When I replace the tokenizer "vi_tokenizer" by "standard", the error did not occur

I'm using elasticsearch 7.3.1, elasticsearch-analysis-vietnamese 7.3.1 and install it using dockerfile:

FROM elasticsearch:7.3.1

COPY elasticsearch-analysis-vietnamese-7.3.1.zip /usr/share/elasticsearch/

RUN cd /usr/share/elasticsearch &&
bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-vietnamese-7.3.1.zip &&
bin/elasticsearch-plugin install analysis-icu

duydo commented

@seta-hainguyen The version 7.3.1 with old VnTokenizer has a lot of issues, I switched to use another tokenizer from CocCoc team for the plugin so I don't maintain the plugin with VnTokenizer any more.

Currently plugin is compatible to ES v7.4.0 and later, you can refer the document to build the plugin with version you expect.

@duydo Thank you. Do you have any notice about java version to build Coccoc tokenizer's project and your project ?

duydo commented

@seta-hainguyen The CocCoc tokenizer is written in C++, so you have to build it as shared library on Elasticsearch node which you intend to install the plugin on.
The ES plugin is compatible with Java 8 and later.