JoinKatakana plugin behaves differently from the Java version
kazuma-t opened this issue · 1 comments
kazuma-t commented
The JoinKatakana plugin always creates OOV nodes when concatenating nodes in concatenate_oov(). The Java version uses Lattice#getMinimumNode() to return the node with the lowest cost if there are nodes within the same range.
Sudachi (Java version)
=== Input dump:
オバケ
=== Lattice dump:
0: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
1: 0 9 オバケ(816334) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
51: 0 3 オ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -640
52: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 3 オ(185851) 67 5946 5946 5621
1: 3 9 バケ(233719) 3 5142 5142 3446
=== After rewriting:
0: 0 9 オバケ(816334) 3 5139 5139 10000
===
オバケ 名詞,普通名詞,一般,*,*,* お化け
EOS
SudachiPy
=== Inupt dump:
オバケ
=== Lattice dump:
1: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
2: 0 9 オバケ(816309) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
41: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before Rewriting:
0: 0 3 オ(185851) 5946 5946 5621�
1: 3 9 バケ(233719) 5142 5142 3446�
=== After Rewriting:
0: 0 9 オバケ(0) 0 0 0�
===
オバケ 名詞,普通名詞,一般,*,*,* オバケ
EOS