WorksApplications/elasticsearch-sudachi

Synonym expansion not working (Elasticsearch v8 + sudachi_split)

rema424 opened this issue · 0 comments

Summary

In an Elasticsearch v8 environment, the synonym expansion is not functioning when using sudachi_split and synonym filters together.

Steps to Reproduce

  1. Set up an Elasticsearch v8 environment
  2. Configure an index to use both sudachi_split and synonym filters
  3. Index documents into the index
  4. Execute a search query containing synonyms

Expected Behavior

The synonym filter should expand synonyms, and documents containing the synonyms should be returned as hits.

Actual Behavior

Synonym expansion does not occur, and documents containing synonyms are not returned as hits.

Related Information

  • In Elasticsearch v7, the sample configuration provided in the documentation worked for synonym expansion
  • The documentation was last updated 4 years ago (Elasticsearch v7), and the behavior may have changed in subsequent updates

Environment

  • OS:
    • macOS 13.4.1
    • arm64
  • Docker version: 26.0.0
  • Elasticsearch version: 8.8.1
  • elasticsearch-sudachi version: 3.1.0
$ sw_vers
ProductName:            macOS
ProductVersion:         13.4.1
BuildVersion:           22F82

$ uname -m 
arm64

$ hostinfo
Mach kernel version:
         Darwin Kernel Version 22.5.0: Thu Jun  8 22:22:19 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T8103
Kernel configured for up to 8 processors.
8 processors are physically available.
8 processors are logically available.
Processor type: arm64e (ARM64E)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 8.00 gigabytes
Default processor set: 419 tasks, 3980 threads, 8 processors
Load average: 2.02, Mach factor: 6.09

$ docker -v
Docker version 26.0.0, build 2ae903e

$ curl -X GET 'http://localhost:9200/'
{
  "name" : "5edac9bc174f",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "rtQ7kzApQ-OSQQ86bnYkPg",
  "version" : {
    "number" : "8.8.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "f8edfccba429b6477927a7c1ce1bc6729521305e",
    "build_date" : "2023-06-05T21:32:25.188464208Z",
    "build_snapshot" : false,
    "lucene_version" : "9.6.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

$ elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.0/elasticsearch-8.8.1-analysis-sudachi-3.1.0.zip

Configuration

Index settings:

{
  "settings": {
    "index": {
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "search": {
            "type": "sudachi_split",
            "mode": "search"
          },
          "synonym": {
            "type": "synonym",
            "synonyms": ["関西国際空港,関空", "関西 => 近畿"]
          }
        },
        "tokenizer": {
          "sudachi_c_tokenizer": {
            "type": "sudachi_tokenizer",
            "additional_settings": "{\"systemDict\":\"system_core.dic\"}",
            "discard_punctuation": "true",
            "split_mode": "C"
          }
        },
        "analyzer": {
          "sudachi_search_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["search"]
          },
          "sudachi_synonym_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["synonym"]
          },
          "sudachi_synonym_search_analyzer": {
            "type": "custom",
            "char_filter": [],
            "tokenizer": "sudachi_c_tokenizer",
            "filter": ["synonym", "search"]
          }
        }
      }
    }
  }
}

Analysis Results

  • With sudachi_split only:

    $ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_search_analyzer", "text" : "関西国際空港"}'
    {
      "tokens" : [
        {
          "token" : "関西国際空港",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "word",
          "position" : 0,
          "positionLength" : 3
        },
        {
          "token" : "関西",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "国際",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "空港",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "word",
          "position" : 2
        }
      ]
    }
  • With synonym filter only:

    $ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_analyzer", "text" : "関西国際空港"}'
    {
      "tokens" : [
        {
          "token" : "関西国際空港",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "関空",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "SYNONYM",
          "position" : 0
        }
      ]
    }
  • With both sudachi_split and synonym filter:

    $ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_search_analyzer", "text" : "関西国際空港"}'
    {
      "tokens" : [
        {
          "token" : "関西国際空港",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "word",
          "position" : 0,
          "positionLength" : 3
        },
        {
          "token" : "関西",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "国際",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "空港",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "word",
          "position" : 2
        }
      ]
    }

    The synonym expansion (関空) is expected but not occurring.

Questions

  1. Is there a way to make synonym expansion work when using sudachi_split and synonym filters together in an Elasticsearch v8 environment?
  2. Are there any reported issues or documents describing a similar problem?
  3. Have any workarounds or alternative configuration methods been found for this issue?

Any help or guidance would be greatly appreciated. Thank you in advance.