duydo/elasticsearch-analysis-vietnamese

The plugin with new C++ tokenizer

duydo opened this issue · 7 comments

duydo commented

From version 7.12.11, I use Cốc Cốc C++ tokenizer for the plugin instead of VnTokenizer. So I close all issues relates to VnTokenizer, I won't maintain the plugin with the VnTokenizer anymore.

The Cốc Cốc tokenizer is used in Cốc Cốc Search and Ads systems and the main goal in its development was to reach high performance while keeping the quality reasonable for search ranking needs.

If you want to use the plugin with prior versions of Elasticsearch, you can build the plugin yourself with the guide in README file.

@duydo The packaged zip of the plugin does not contain the tokenizer. What is the process for installing the new tokenizer on an Elasticsearch node ?

duydo commented

@soosinha The tokenizer is written in C++ so we have to build it as a shared lib on Elasticsearch node. You can refer the installation guide in README file, section: "Step 1: Build C++ tokenizer for Vietnamese library"

@duydo I tried to build and run your plugin with Cốc Cốc C++ tokenizer and got crash time to time
Message in log
double free or corruption (fasttop)
ES version: >= 7.12.1
Could you help to check when you have time, pls?
I think we have problem with new tokenizer or binding from C++ to Java
Thank you

===============
Updated:
I checked and found that we get this problem only when create index with more than 1 shards. ES uses 1 thread for each shard and I think this C++ library is not thread safe so it causes crash

duydo commented

@kennynguyeenx

=============== Updated: I checked and found that we get this problem only when create index with more than 1 shards. ES uses 1 thread for each shard and I think this C++ library is not thread safe so it causes crash

This issue has been fixed in this branch https://github.com/duydo/elasticsearch-analysis-vietnamese/tree/feature/search-issues

@duydo Thank you very much

Hi, @duydo
We're using the elasticsearch-analysis-vietnamese plugin we keep on getting this error:
12]: *** Error in `/usr/share/elasticsearch/jdk/bin/java': double free or corruption (!prev): 0x00007f49dc04bf70 ***
12]: ======= Backtrace: =========
12]: /lib64/libc.so.6(+0x81329)[0x7f4a8754a329]
12]: /usr/lib/libcoccoc_tokenizer_jni.so(_ZN3spp11sparsetableISt4pairIKifENS_14libc_allocatorIS3_EEE12_free_groupsEv+0x2d)[0x7f49c92e7fed]

ES version is 7.16.2

Thanks!

===========================================
Update: Had another crash and here's what journalctl -u elasticsearch.service is showing:

*** Error in `/usr/share/elasticsearch/jdk/bin/java': double free or corruption (out): 0x00007fe0e407b340 ***

A fatal error has been detected by the Java Runtime Environment:

SIGBUS (0x7) at pc=0x00007fe11e1064bc, pid=5398, tid=5772

JRE version: OpenJDK Runtime Environment Temurin-17.0.1+12 (17.0.1+12) (build 17.0.1+12)
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (17.0.1+12, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64
Problematic frame:
C[thread 5592 also had an error]
[thread 5775 also had an error]
======= Backtrace: =========
[libc.so.6+0x804bc]/lib64/libc.so.6(+0x81329)[0x7fe11e107329]
/usr/lib/libcoccoc_tokenizer_jni.so(_ZN3spp11sparsetableISt4pairIKifENS_14libc_allocatorIS3_EEE12_free_groupsEv+0x2d)[0x7fe05c8abfed]
/usr/lib/libcoccoc_tokenizer_jni.so(_ZN9Tokenizer24unserialize_nontone_dataERKSs+0x11d)[0x7fe05c8b149d]
/usr/lib/libcoccoc_tokenizer_jni.so(Java_com_coccoc_Tokenizer_initialize+0x1ee)[0x7fe05c8a746e]
[0x7fe10129053a]
======= Memory map: ========
580000000-7ff700000 rw-p 00000000 00:00 0