elasticsearch-jieba-plugin

20221201更新

新增分支:

7.17.x分支，支持es 7.17.0，JDK版本：11.0.7, gradle版本：7.6
8.4.1分支，支持es 8.4.1，JDK版本：18.0.2.1, gradle版本：7.6

当适配不同的ES版本，以及JDK版本，需要参考ES和JDK版本的对应关系
适配不同ES版本,修改以下文件，需要修改的地方已经注明

build.gradle
src/main/resources/plugin-descriptor.properties

需要切换不同的gradle版本,7.6是要切换的目标版本

gradle wrapper --gradle-version 7.6

jieba analysis plugin for elasticsearch: 7.7.0, 7.4.2, 7.3.0, 7.0.0, 6.4.0, 6.0.0 , 5.4.0, 5.3.0, 5.2.2, 5.2.1, 5.2.0, 5.1.2, 5.1.1

特点

支持动态添加字典，不重启ES。

如果是ES6.4.0的版本，请使用6.4.0分支最新的代码，或者master分支最新代码，也可以下载6.4.1的release，强烈推荐升级！

6.4.1的release，解决了PositionIncrement问题。详细说明见ES分词PositionIncrement解析

版本对应

分支	tag	elasticsearch版本	Release Link
7.7.0	tag v7.7.1	v7.7.0	Download: v7.7.0
7.4.2	tag v7.4.2	v7.4.2	Download: v7.4.2
7.3.0	tag v7.3.0	v7.3.0	Download: v7.3.0
7.0.0	tag v7.0.0	v7.0.0	Download: v7.0.0
6.4.0	tag v6.4.1	v6.4.0	Download: v6.4.1
6.4.0	tag v6.4.0	v6.4.0	Download: v6.4.0
6.0.0	tag v6.0.0	v6.0.0	Download: v6.0.1
5.4.0	tag v5.4.0	v5.4.0	Download: v5.4.0
5.3.0	tag v5.3.0	v5.3.0	Download: v5.3.0
5.2.2	tag v5.2.2	v5.2.2	Download: v5.2.2
5.2.1	tag v5.2.1	v5.2.1	Download: v5.2.1
5.2	tag v5.2.0	v5.2.0	Download: v5.2.0
5.1.2	tag v5.1.2	v5.1.2	Download: v5.1.2
5.1.1	tag v5.1.1	v5.1.1	Download: v5.1.1

more details

choose right version source code.
run

git clone https://github.com/sing1ee/elasticsearch-jieba-plugin.git --recursive
./gradlew clean pz

copy the zip file to plugin directory

cp build/distributions/elasticsearch-jieba-plugin-5.1.2.zip ${path.home}/plugins

unzip and rm zip file

unzip elasticsearch-jieba-plugin-5.1.2.zip
rm elasticsearch-jieba-plugin-5.1.2.zip

start elasticsearch

./bin/elasticsearch

Custom User Dict

Just put you dict file with suffix .dict into ${path.home}/plugins/jieba/dic. Your dict file should like this:

小清新 3
百搭 3
显瘦 3
隨身碟 100
your_word word_freq

Using stopwords

find stopwords.txt in ${path.home}/plugins/jieba/dic.
create folder named stopwords under ${path.home}/config

mkdir -p {path.home}/config/stopwords

copy stopwords.txt into the folder just created

cp ${path.home}/plugins/jieba/dic/stopwords.txt {path.home}/config/stopwords

create index:

PUT http://localhost:9200/jieba_index

{
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {
          "type":        "stop",
          "stopwords_path": "stopwords/stopwords.txt"
        },
        "jieba_synonym": {
          "type":        "synonym",
          "synonyms_path": "synonyms/synonyms.txt"
        }
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "jieba_stop",
            "jieba_synonym"
          ]
        }
      }
    }
  }
}

test analyzer:

PUT http://localhost:9200/jieba_index/_analyze
{
  "analyzer" : "my_ana",
  "text" : "黄河之水天上来"
}

Response as follow:

{
    "tokens": [
        {
            "token": "黄河",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "黄河之水天上来",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0
        },
        {
            "token": "之水",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "天上",
            "start_offset": 4,
            "end_offset": 6,
            "type": "word",
            "position": 2
        },
        {
            "token": "上来",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 2
        }
    ]
}

NOTE

migrate from jieba-solr

Roadmap

I will add more analyzer support:

stanford chinese analyzer
fudan nlp analyzer
...

If you have some ideas, you should create an issue. Then, we will do it together.

sing1ee/elasticsearch-jieba-plugin