Unexpected behavior with `sudachi_ja_stop` Preceding `sudachi_normalizedform`

Question

Unexpected behavior with `sudachi_ja_stop` Preceding `sudachi_normalizedform`

togatoga opened this issue a year ago · 0 comments

When sudachi_ja_stop is placed before sudachi_normalizedform, it does not work as expected.

I added an experiment to the test code of the forked repository, and you can actually run it. The test passes, but the behavior is not as expected.

When sudachi_ja_stop is placed before sudachi_normalizedform as shown below, the stopwords do not work as intended. I expected the query "東京にふく" to be split into "東京", "に", and "ふく", and with the use of stopwords, only "東京" would remain. However, the actual result is "東京", "に", and "吹く". This happens even if "吹く" or "に" is included in stopwords; the behavior remains the same.

{
  "index.analysis": {
    "analyzer": {
      "sudachi_test": {
        "type": "custom",
        "tokenizer": "sudachi_tokenizer",
        "filter": ["my_stopfilter", "sudachi_normalizedform"]
      }
    },
    "tokenizer": {
      "sudachi_tokenizer": {
        "type": "sudachi_tokenizer",
        "split_mode": "C"
      }
    },
    "filter": {
      "my_stopfilter": {
        "type": "sudachi_ja_stop",
        "stopwords": ["に", "ふく", "吹く"]
      }
    }
  }
}

Conversely, if the order of sudachi_ja_stop and sudachi_normalizedform is swapped, and the normalized string ("吹く") is included in stopwords, it works. The query "東京にふく" is converted to "東京", but it is not the expected behavior to include the normalized string in stopwords.

{
  "index.analysis": {
    "analyzer": {
      "sudachi_test": {
        "type": "custom",
        "tokenizer": "sudachi_tokenizer",
        "filter": ["sudachi_normalizedform", "my_stopfilter"]
      }
    },
    "tokenizer": {
      "sudachi_tokenizer": {
        "type": "sudachi_tokenizer",
        "split_mode": "C"
      }
    },
    "filter": {
      "my_stopfilter": {
        "type": "sudachi_ja_stop",
        "stopwords": ["に", "吹く"]
      }
    }
  }
}

The specific issue is that in the phrase "確認したい" , the word "し" is not dropped by stopwords as desired, because it is transformed into "為る" and cannot be excluded. As a workaround, adding "為る" to stopwords resolves the issue, but it is not a fundamental solution.

I am eager to contribute to the development. I have already started reading and trying to understand the code. If there is anything I can help with, please let me know.