Tokenizer "vi_tokenizer" doesn't work with character filter "html_strip"
seta-hainguyen opened this issue · 3 comments
Here is my settings for analyzer:
"vn_html_analyzer": { "filter": [ "icu_folding" ], "char_filter": [ "html_strip" ], "type": "custom", "tokenizer": "vi_tokenizer" }
When I tried:
GET localhost:9200/question/_analyze { "analyzer" : "vn_html_analyzer", "text" : "<p>đỗ đại học</p>" }
It throws error:
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[7d1c0721c1d6][172.17.0.2:9300][indices:admin/analyze[s]]"}],"type":"string_index_out_of_bounds_exception","reason":"String index out of range: -1"},"status":500}
When I replace the tokenizer "vi_tokenizer" by "standard", the error did not occur
I'm using elasticsearch 7.3.1, elasticsearch-analysis-vietnamese 7.3.1 and install it using dockerfile:
FROM elasticsearch:7.3.1
COPY elasticsearch-analysis-vietnamese-7.3.1.zip /usr/share/elasticsearch/
RUN cd /usr/share/elasticsearch &&
bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-vietnamese-7.3.1.zip &&
bin/elasticsearch-plugin install analysis-icu
@seta-hainguyen The version 7.3.1 with old VnTokenizer has a lot of issues, I switched to use another tokenizer from CocCoc team for the plugin so I don't maintain the plugin with VnTokenizer any more.
Currently plugin is compatible to ES v7.4.0 and later, you can refer the document to build the plugin with version you expect.
@duydo Thank you. Do you have any notice about java version to build Coccoc tokenizer's project and your project ?
@seta-hainguyen The CocCoc tokenizer is written in C++, so you have to build it as shared library on Elasticsearch node which you intend to install the plugin on.
The ES plugin is compatible with Java 8 and later.