Synonym expansion not working (Elasticsearch v8 + sudachi_split)
rema424 opened this issue · 0 comments
rema424 commented
Summary
In an Elasticsearch v8 environment, the synonym expansion is not functioning when using sudachi_split
and synonym
filters together.
Steps to Reproduce
- Set up an Elasticsearch v8 environment
- Configure an index to use both
sudachi_split
andsynonym
filters - Index documents into the index
- Execute a search query containing synonyms
Expected Behavior
The synonym
filter should expand synonyms, and documents containing the synonyms should be returned as hits.
Actual Behavior
Synonym expansion does not occur, and documents containing synonyms are not returned as hits.
Related Information
- In Elasticsearch v7, the sample configuration provided in the documentation worked for synonym expansion
- The documentation was last updated 4 years ago (Elasticsearch v7), and the behavior may have changed in subsequent updates
Environment
- OS:
- macOS 13.4.1
- arm64
- Docker version: 26.0.0
- Elasticsearch version: 8.8.1
- elasticsearch-sudachi version: 3.1.0
$ sw_vers
ProductName: macOS
ProductVersion: 13.4.1
BuildVersion: 22F82
$ uname -m
arm64
$ hostinfo
Mach kernel version:
Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:19 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T8103
Kernel configured for up to 8 processors.
8 processors are physically available.
8 processors are logically available.
Processor type: arm64e (ARM64E)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 8.00 gigabytes
Default processor set: 419 tasks, 3980 threads, 8 processors
Load average: 2.02, Mach factor: 6.09
$ docker -v
Docker version 26.0.0, build 2ae903e
$ curl -X GET 'http://localhost:9200/'
{
"name" : "5edac9bc174f",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "rtQ7kzApQ-OSQQ86bnYkPg",
"version" : {
"number" : "8.8.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "f8edfccba429b6477927a7c1ce1bc6729521305e",
"build_date" : "2023-06-05T21:32:25.188464208Z",
"build_snapshot" : false,
"lucene_version" : "9.6.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
$ elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.0/elasticsearch-8.8.1-analysis-sudachi-3.1.0.zip
Configuration
Index settings:
{
"settings": {
"index": {
"number_of_replicas": "0",
"analysis": {
"filter": {
"search": {
"type": "sudachi_split",
"mode": "search"
},
"synonym": {
"type": "synonym",
"synonyms": ["関西国際空港,関空", "関西 => 近畿"]
}
},
"tokenizer": {
"sudachi_c_tokenizer": {
"type": "sudachi_tokenizer",
"additional_settings": "{\"systemDict\":\"system_core.dic\"}",
"discard_punctuation": "true",
"split_mode": "C"
}
},
"analyzer": {
"sudachi_search_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["search"]
},
"sudachi_synonym_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["synonym"]
},
"sudachi_synonym_search_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["synonym", "search"]
}
}
}
}
}
}
Analysis Results
-
With
sudachi_split
only:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_search_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0, "positionLength" : 3 }, { "token" : "関西", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "国際", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "空港", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 2 } ] }
-
With
synonym
filter only:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0 }, { "token" : "関空", "start_offset" : 0, "end_offset" : 6, "type" : "SYNONYM", "position" : 0 } ] }
-
With both
sudachi_split
andsynonym
filter:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_search_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0, "positionLength" : 3 }, { "token" : "関西", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "国際", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "空港", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 2 } ] }
The synonym expansion (関空) is expected but not occurring.
Questions
- Is there a way to make synonym expansion work when using
sudachi_split
andsynonym
filters together in an Elasticsearch v8 environment? - Are there any reported issues or documents describing a similar problem?
- Have any workarounds or alternative configuration methods been found for this issue?
Any help or guidance would be greatly appreciated. Thank you in advance.