WorksApplications/elasticsearch-sudachi

Duplicate tokens for OOV when using `sudachi_split` filter's `extended` mode

sorami opened this issue · 6 comments

I get strange output when using sudachi_split plugin with extended mode. The results are fine when using search mode.

For example, the input text bミチゴ becomes [b, b , , , ミチゴ, , , ].

Also, strangely, the analysis result changes after the 2nd and onward analysis.

Example

Elasticsearch index setting

sudachi_split filter with extended mode.

Full JSON
{
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_sudachi_analyzer": {
                    "type": "custom",
                    "tokenizer": "custom_sudachi_tokenizer",
                    "char_filter": [],
                    "filter": ["custom_sudachi_split"]
                }
            },
            "tokenizer": {
                "custom_sudachi_tokenizer": {
                    "type": "sudachi_tokenizer",
                    "resources_path": "sudachi/"
                }
            },
            "filter": {
                "custom_sudachi_split": {
                    "type": "sudachi_split",
                    "mode": "extended"
                }
            }
        }
    }
}

Analysis with the tokenizer

Case A. aミチゴ 👍

=> a / ミチゴ / / /

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "aミチゴ"
}

Response

{
    "tokens": [
        {
            "token": "a",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 1,
            "positionLength": 3
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 3
        }
    ]
}

Case B. bミチゴ 👎

Analysis for the 1st time 😕

=> b / b / ミチゴ / / /

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "bミチゴ"
}
{
    "tokens": [
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 1,
            "positionLength": 3
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 3
        }
    ]
}
Analysis for the 2nd time onwards 😕 😕 😕

=> b / b / / / ミチゴ / / /

Full query and response

Query

GET http://localhost:9200/sudachi-split-test/_analyze
{
    "analyzer": "custom_sudachi_analyzer",
    "text": "bミチゴ"
}
{
    "tokens": [
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 2
        },
        {
            "token": "ミチゴ",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 3,
            "positionLength": 3
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 3
        },
        {
            "token": "",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 4
        },
        {
            "token": "",
            "start_offset": 3,
            "end_offset": 4,
            "type": "word",
            "position": 5
        }
    ]
}

Reference: Sudachi analysis result (w/o Elasticsearch)

a is not OOV, whereas b is.

Case A. aミチゴ

$ echo "aミチゴ" | java -jar target/sudachi-0.4.3.jar -a
a       名詞,普通名詞,助数詞可能,*,*,*  a       a       アール  0
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ  ミチゴ          -1      (OOV)
EOS
Debug output
$ echo "aミチゴ" | java -jar target/sudachi-0.4.3.jar -d

=== Input dump:
aミチゴ
=== Lattice dump:
0: 10 10 (null)(0) BOS/EOS 0 0 0: 50 -739 -286 -944 211 -250 -852 50 -739 -286 -944 211 -250 -973 -852 -852 -522 -522 1908 50 -739 -286 -944 211 -250
1: 1 10 ミチゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
2: 1 10 ミチゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
3: 1 10 ミチゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
4: 1 10 ミチゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
5: 1 10 ミチゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
6: 1 10 ミチゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
7: 4 10 チゴ(219342) 名詞,普通名詞,一般,*,*,* 5142 5142 3939: 2052 -1145 5211 657 884 432 169 1010 722
8: 4 10 チゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
9: 4 10 チゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
10: 4 10 チゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
11: 4 10 チゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
12: 4 10 チゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
13: 4 10 チゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
14: 7 10 ゴ(202390) 名詞,数詞,*,*,*,* 4904 4904 13000: 2864 1110 1995 1111 748 2080 1458 1247 1940 1887 1995 1111 748 2080 1458 1247
15: 7 10 ゴ(202391) 名詞,普通名詞,一般,*,*,* 5142 5142 10761: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
16: 7 10 ゴ(202392) 名詞,普通名詞,一般,*,*,* 5142 5142 8764: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
17: 7 10 ゴ(202393) 名詞,普通名詞,一般,*,*,* 5146 5146 6234: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
18: 7 10 ゴ(202394) 名詞,普通名詞,一般,*,*,* 5146 5146 8119: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
19: 7 10 ゴ(202395) 接頭辞,*,*,*,*,* 5950 5950 8138: 1672 1340 1985 1061 1271 1256 1744 494 2650 1688 1985 1061 1271 1256 1744 494
20: 7 10 ゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1193 636 -520 66 63 -720 -106 317 -1016 2704 -520 66 63 -720 -106 317
21: 7 10 ゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1738 717 110 154 829 634 907 97 -560 1662 110 154 829 634 907 97
22: 7 10 ゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1262 995 121 437 -250 1173 417 394 -191 1583 121 437 -250 1173 417 394
23: 7 10 ゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1387 1464 -550 395 911 420 1611 -200 -512 1900 -550 395 911 420 1611 -200
24: 7 10 ゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1569 468 -737 547 251 1118 703 687 -873 1571 -737 547 251 1118 703 687
25: 7 10 ゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 794 1254 1244 745 570 560 462 -1442 1356 -880 1244 745 570 560 462 -1442
26: 1 7 ミチ(255291) 名詞,固有名詞,人名,名,*,* 4789 4789 6820: 2897 2778 2778 2897 2778 2778 987 1222 507 1957 365
27: 1 7 ミチ(255292) 名詞,普通名詞,形状詞可能,*,*,* 5159 5159 3633: 1043 1651 1651 1043 1651 1651 1151 1634 1272 1302 521
28: 1 7 ミチ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
29: 1 7 ミチ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
30: 1 7 ミチ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
31: 1 7 ミチ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
32: 1 7 ミチ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
33: 1 7 ミチ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
34: 4 7 チ(218952) 名詞,普通名詞,一般,*,*,* 5144 5144 4708: 1561 2832 3880 -1429 -285 -112 -280 -316 397
35: 4 7 チ(218953) 記号,一般,*,*,*,* 5977 5977 20000: 492 807 -2936 2391 1722 895 1000 1026 301
36: 4 7 チ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
37: 4 7 チ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
38: 4 7 チ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
39: 4 7 チ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
40: 4 7 チ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
41: 4 7 チ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
42: 1 4 ミ(254959) 名詞,数詞,*,*,*,* 4934 4934 13000: 2342 2166 2166 2342 2166 2166 1614 -16 2016 1636 1263
43: 1 4 ミ(254960) 接頭辞,*,*,*,*,* 5953 5953 6459: 2639 1717 1717 2639 1717 1717 1605 437 1120 1793 514
44: 1 4 ミ(254961) 記号,一般,*,*,*,* 5977 5977 20000: 2031 -709 -709 2031 -709 -709 2391 895 1000 1026 301
45: 1 4 ミ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: -951 1598 1598 -951 1598 1598 -520 63 -720 -106 317
46: 1 4 ミ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: -368 1807 1807 -368 1807 1807 110 829 634 907 97
47: 1 4 ミ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: -177 1616 1616 -177 1616 1616 121 -250 1173 417 394
48: 1 4 ミ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: -321 1904 1904 -321 1904 1904 -550 911 420 1611 -200
49: 1 4 ミ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 239 1619 1619 239 1619 1619 -737 251 1118 703 687
50: 1 4 ミ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 666 -805 -805 666 -805 -805 1244 570 560 462 -1442
51: 0 1 A(207) 名詞,普通名詞,助数詞可能,*,*,* 5152 5152 10078: 2104
52: 0 1 A(208) 記号,文字,*,*,*,* 5978 5978 20000: 417
53: 0 1 A(209) 記号,文字,*,*,*,* 5978 5978 20000: 417
54: 0 1 a(5187) 名詞,普通名詞,助数詞可能,*,*,* 5152 5152 8579: 2104
55: 0 1 a(5188) 記号,文字,*,*,*,* 5978 5978 20000: 417
56: 0 1 a(5189) 記号,文字,*,*,*,* 5978 5978 20000: 417
57: 0 1 a(0) 名詞,普通名詞,一般,*,*,* 5139 5139 11633: 893
58: 0 1 a(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13620: 234
59: 0 1 a(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 14228: 709
60: 0 1 a(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 15793: 506
61: 0 1 a(0) 感動詞,一般,*,*,*,* 5687 5687 15246: -640
62: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 1 a(5187) 11 5152 5152 8579
1: 1 10 ミチゴ(0) 3 5139 5139 10980
=== After rewriting:
0: 0 1 a(5187) 11 5152 5152 8579
1: 1 10 ミチゴ(0) 3 5139 5139 10980
===
a       名詞,普通名詞,助数詞可能,*,*,*  a
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ
EOS

Case B. bミチゴ

$ echo "bミチゴ" | java -jar target/sudachi-0.4.3.jar -a
b       名詞,普通名詞,一般,*,*,*        b       b               -1      (OOV)
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ  ミチゴ          -1      (OOV)
EOS
Debug output
$ echo "bミチゴ" | java -jar target/sudachi-0.4.3.jar -d

=== Input dump:
bミチゴ
=== Lattice dump:
0: 10 10 (null)(0) BOS/EOS 0 0 0: 50 -739 -286 -944 211 -250 -852 50 -739 -286 -944 211 -250 -973 -852 -852 -522 -522 1908 50 -739 -286 -944 211 -250
1: 1 10 ミチゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
2: 1 10 ミチゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
3: 1 10 ミチゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
4: 1 10 ミチゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
5: 1 10 ミチゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
6: 1 10 ミチゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
7: 4 10 チゴ(219342) 名詞,普通名詞,一般,*,*,* 5142 5142 3939: 2052 -1145 5211 657 884 432 169 1010 722
8: 4 10 チゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
9: 4 10 チゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
10: 4 10 チゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
11: 4 10 チゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
12: 4 10 チゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
13: 4 10 チゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
14: 7 10 ゴ(202390) 名詞,数詞,*,*,*,* 4904 4904 13000: 2864 1110 1995 1111 748 2080 1458 1247 1940 1887 1995 1111 748 2080 1458 1247
15: 7 10 ゴ(202391) 名詞,普通名詞,一般,*,*,* 5142 5142 10761: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
16: 7 10 ゴ(202392) 名詞,普通名詞,一般,*,*,* 5142 5142 8764: 3392 2837 657 884 432 169 1010 722 2798 5211 657 884 432 169 1010 722
17: 7 10 ゴ(202393) 名詞,普通名詞,一般,*,*,* 5146 5146 6234: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
18: 7 10 ゴ(202394) 名詞,普通名詞,一般,*,*,* 5146 5146 8119: 1132 1039 890 684 998 35 570 698 2099 2428 890 684 998 35 570 698
19: 7 10 ゴ(202395) 接頭辞,*,*,*,*,* 5950 5950 8138: 1672 1340 1985 1061 1271 1256 1744 494 2650 1688 1985 1061 1271 1256 1744 494
20: 7 10 ゴ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1193 636 -520 66 63 -720 -106 317 -1016 2704 -520 66 63 -720 -106 317
21: 7 10 ゴ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1738 717 110 154 829 634 907 97 -560 1662 110 154 829 634 907 97
22: 7 10 ゴ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1262 995 121 437 -250 1173 417 394 -191 1583 121 437 -250 1173 417 394
23: 7 10 ゴ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1387 1464 -550 395 911 420 1611 -200 -512 1900 -550 395 911 420 1611 -200
24: 7 10 ゴ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1569 468 -737 547 251 1118 703 687 -873 1571 -737 547 251 1118 703 687
25: 7 10 ゴ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 794 1254 1244 745 570 560 462 -1442 1356 -880 1244 745 570 560 462 -1442
26: 1 7 ミチ(255291) 名詞,固有名詞,人名,名,*,* 4789 4789 6820: 2778 2778 2778 2778 987 1222 507 1957 365
27: 1 7 ミチ(255292) 名詞,普通名詞,形状詞可能,*,*,* 5159 5159 3633: 1651 1651 1651 1651 1151 1634 1272 1302 521
28: 1 7 ミチ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
29: 1 7 ミチ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
30: 1 7 ミチ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
31: 1 7 ミチ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
32: 1 7 ミチ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
33: 1 7 ミチ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
34: 4 7 チ(218952) 名詞,普通名詞,一般,*,*,* 5144 5144 4708: 1561 2832 3880 -1429 -285 -112 -280 -316 397
35: 4 7 チ(218953) 記号,一般,*,*,*,* 5977 5977 20000: 492 807 -2936 2391 1722 895 1000 1026 301
36: 4 7 チ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1317 35 2704 -520 66 63 -720 -106 317
37: 4 7 チ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1862 -386 1662 110 154 829 634 907 97
38: 4 7 チ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 837 595 1583 121 437 -250 1173 417 394
39: 4 7 チ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 2540 758 1900 -550 395 911 420 1611 -200
40: 4 7 チ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1367 -358 1571 -737 547 251 1118 703 687
41: 4 7 チ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: 1508 992 -880 1244 745 570 560 462 -1442
42: 1 4 ミ(254959) 名詞,数詞,*,*,*,* 4934 4934 13000: 2166 2166 2166 2166 1614 -16 2016 1636 1263
43: 1 4 ミ(254960) 接頭辞,*,*,*,*,* 5953 5953 6459: 1717 1717 1717 1717 1605 437 1120 1793 514
44: 1 4 ミ(254961) 記号,一般,*,*,*,* 5977 5977 20000: -709 -709 -709 -709 2391 895 1000 1026 301
45: 1 4 ミ(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 1598 1598 1598 1598 -520 63 -720 -106 317
46: 1 4 ミ(0) 名詞,普通名詞,サ変可能,*,*,* 5129 5129 14802: 1807 1807 1807 1807 110 829 634 907 97
47: 1 4 ミ(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13451: 1616 1616 1616 1616 121 -250 1173 417 394
48: 1 4 ミ(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 13759: 1904 1904 1904 1904 -550 911 420 1611 -200
49: 1 4 ミ(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 14554: 1619 1619 1619 1619 -737 251 1118 703 687
50: 1 4 ミ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -805 -805 -805 -805 1244 570 560 462 -1442
51: 0 1 B(573) 記号,文字,*,*,*,* 5978 5978 20000: 417
52: 0 1 B(574) 記号,文字,*,*,*,* 5978 5978 20000: 417
53: 0 1 b(5244) 記号,文字,*,*,*,* 5978 5978 20000: 417
54: 0 1 b(5245) 記号,文字,*,*,*,* 5978 5978 20000: 417
55: 0 1 b(0) 名詞,普通名詞,一般,*,*,* 5139 5139 11633: 893
56: 0 1 b(0) 名詞,固有名詞,一般,*,*,* 4785 4785 13620: 234
57: 0 1 b(0) 名詞,固有名詞,人名,一般,*,* 4787 4787 14228: 709
58: 0 1 b(0) 名詞,固有名詞,地名,一般,*,* 4791 4791 15793: 506
59: 0 1 b(0) 感動詞,一般,*,*,*,* 5687 5687 15246: -640
60: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 1 b(0) 3 5139 5139 11633
1: 1 4 ミ(254960) 67 5953 5953 6459
2: 4 10 チゴ(219342) 3 5142 5142 3939
=== After rewriting:
0: 0 1 b(0) 3 5139 5139 11633
1: 1 10 ミチゴ(0) 3 5139 5139 10980
===
b       名詞,普通名詞,一般,*,*,*        b
ミチゴ  名詞,普通名詞,一般,*,*,*        ミチゴ
EOS

In TestSudachiSplitFilter, there is a case to test OOV with the extended mode;

"アマゾンに行った。" => { "アマゾン", "ア", "マ", "ゾ", "ン", "に", "行っ", "た" }

    @Test
    public void testWithOOVByExtendedMode() throws IOException {
        tokenStream = setUpTokenStream("extended", "アマゾンに行った。");
        assertTokenStreamContents(tokenStream,
                                  new String[] { "アマゾン", "ア", "マ", "ゾ", "ン", "に", "行っ", "た" },
                                  new int[] { 0, 0, 1, 2, 3, 4, 5, 7 },
                                  new int[] { 4, 1, 2, 3, 4, 5, 7, 8 },
                                  new int[] { 1, 0, 1, 1, 1, 1, 1, 1 },
                                  new int[] { 4, 1, 1, 1, 1, 1, 1, 1 },
                                  9);

However, when I tried with a slightly different text, the behavior is different;

"アマゾン" => { "ア", "ア", "マ", "マ", "ゾ", "ゾ", "ン", "ン" }

e.g., this test case will pass successfully;

    @Test
    public void testWithOOVByExtendedMode2() throws IOException {
        tokenStream = setUpTokenStream("extended", "アマゾン");
        assertTokenStreamContents(tokenStream,
                                  new String[] { "ア", "ア", "マ", "マ", "ゾ", "ゾ", "ン", "ン" },
                                  new int[] { 0, 0, 1, 1, 2, 2, 3, 3 },
                                  new int[] { 1, 1, 2, 2, 3, 3, 4, 4 },
                                  new int[] { 1, 0, 1, 0, 1, 0, 1, 0 },
                                  new int[] { 1, 1, 1, 1, 1, 1, 1, 1 },
                                  4);
    }

I believe the expected output would be { "アマゾン", "ア", "マ", "ゾ", "ン"} (or, in the test time, no JoinKatakanaOovPlugin and only simple JoinOovPlugin, therefore {"ア", "マ", "ゾ", "ン"}?).

I think (at least one of) the problem is that, in extended mode, the "per-character token for OOV" is produced even when the OOV is a single character.

This could be a critical problem as it causes errors when indexing documents with such characters, e.g.,

"error": {
  "type": "illegal_argument_exception",
  "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=67,endOffset=68,lastStartOffset=77 for field 'body'"
}

I believe the above error will be fixed by the above PR.

I've fixed this in #64, can you try it?

It works 🎉

Elasticsearch v7.7.0, SudachiSplitFilter extended mode;

Analysis result is fine.

  • aミチゴ -> a / ミチゴ / ミ / チ / ゴ
  • bミチゴ -> b / ミチゴ / ミ / チ / ゴ

I have also confirmed that the "startOffset must be non-negative, and endOffset must be >= startOffset, ..." errors are gone when indexing previously problematic documents.

Thank you!