WorksApplications/sudachi.rs

If a dictionary contains U+30FC hyphens(ー), it is not registered in the user dictionary.

Dormir30 opened this issue · 23 comments

If I register a word in csv that contains a double-byte hyphen, the registration is not reflected and it is broken up. Words that do not contain double-byte hyphens are reflected correctly.
The procedure is as follows.

  1. Input the word in the csv
  2. Update the dic in sudachipy
  3. !sudachipy ubuild -s '{site_package}/sudachidict_core/resources/system.dic' ./.sudachi/user_dic.csv -o ./.sudachi/user.dic
  4. Change the path in sudachi.json ("userDict")
  5. The json file is then changed to utf-8.
  6. After this, it will not work, for example, if I run the following on the command line.

echo "ビューティーキラー" | sudachipy

My result is below:
ビューティー 名詞,普通名詞,一般,,,* ビューティー
キラー 名詞,普通名詞,一般,,,* キラー
EOS

But I want to see below:
ビューティーキラー 名詞,普通名詞,一般,,,* ビューティーキラー
EOS

Words that do not contain such "-" are reflected in the dictionary.

My environment:

Operating System: Windows 10 Pro
Python Version Used:Python 3.10.5
spaCy Version Used: spacy 3.5.0
Environment Information:Intel(E) core(TM) i7-7700 CPU
Other library info:
spacy-alignments 0.9.0
spacy-legacy 3.0.11
spacy-loggers 1.0.4
spacy-transformers 1.1.9
SudachiDict-core 20221021
SudachiPy 0.6.6
SudachiTra 0.1.7

Sudachi performs NKFC Unicode normalization internally before performing the analysis, so dictionary words which are not in NKFC form can not be found at all. They are registered, but they could never be used for the analysis.

Adding warnings during dictionary compilation for such dictionary words are on todo list.

Thanks for the reply.
So is this word basically not possible?
For example.

"ビュ-ティ-キラ-"

Is there any other hyphen-like person for U+30FC hyphen that can be recognized as a word? For example, "-", "~", etc.

hyphen-like person
Do you mean character?

By default we use a plugin which replaces several hyphen-like characters with katakana prolongation marks.
https://github.com/WorksApplications/sudachi.rs/blob/develop/resources/sudachi.json#L7
It is possible to add other hyphens here.

Word which was registered with ASCII hyphens should match with fullwidth hyphens if they are normalized into half-width ones (I am not sure if they are, you should check). But mixed usage of hyphens, especially instead of U+30FC are one of nasty things of user-generated Japanese text, I agree with you.

Since U+30FC (KATAKANA-HIRAGANA PROLONGED SOUND MARK) does not change with NKF, I think this issue may be related to the word cost. I tried using a user dictionary created with SudachiPy here in the Java version and the dictionary worked properly. Perhaps the word cost written in the documentation is too high, which is why it does not appear in the analysis results.

user_dict.csv:

ビューティーキラー,5146,5146,8000,ビューティーキラー,名詞,普通名詞,一般,*,*,*,ビューティーキラー,ビューティーキラー,*,*,*,*,*
echo ビューティーキラー | java -jar sudachi-0.6.2.jar -s '{"systemDict":"./system_core.dic","userDict":["./user_dict.dic"]}' -d
=== Input dump:
ビューティーキラー
=== Lattice dump:
0: 27 27 (null)(0) BOS/EOS 0 0 0: -522 50 -739 -286 -944 211 -250 50 -739 -286 -944 211 -250 -1143 -238 50 -739 -286 -944 211 -250 -593 101 50 -739 -286 -944 211 -250 -447
1: 0 27 ビューティーキラー(268435456) 名詞,数詞,*,*,*,* 5146 5146 8000: 447
2: 0 27 ビューティーキラー(0) 名詞,普通名詞,一般,*,*,* 5139 5139 10980: 893
...
101: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 18 ビューティー(238854) 3 5144 5144 2034
1: 18 27 キラー(194120) 3 5144 5144 1326
=== After rewriting:
0: 0 18 ビューティー(238854) 3 5144 5144 2034
1: 18 27 キラー(194120) 3 5144 5144 1326
===
ビューティー    名詞,普通名詞,一般,*,*,*        ビューティー
キラー  名詞,普通名詞,一般,*,*,*        キラー
EOS

The system dictionaries used in the Java and Python versions are different, so the part-of-speech names may be different, but let's ignore that for now.

I tried running it on the java command line, but got the following error How do I run it?
I did some research on the error but could not find it.

C:\Users\AAAAA\Desktop\code>echo ビューティーキラー | java -jar sudachi-0.6.2.jar -s '{"systemDict":"./system_core.dic","userDict":["./user.dic"]}' -d

Exception in thread "main" java.lang.NoClassDefFoundError: javax/json/stream/JsonParsingException

    at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:197)

Caused by: java.lang.ClassNotFoundException: javax.json.stream.JsonParsingException

    at java.net.URLClassLoader.findClass(Unknown Source)

    at java.lang.ClassLoader.loadClass(Unknown Source)

    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

    at java.lang.ClassLoader.loadClass(Unknown Source)

    ... 1 more

Please extract the release package (https://github.com/WorksApplications/Sudachi/releases/download/v0.6.3/sudachi-0.6.3-executable.zip) and place the included libraries (javax.json-1.1.4.jar and jdartsclone-1.2.0.jar) in the same folder before executing.

$ ls
LICENSE-2.0.txt  javax.json-1.1.4.jar   licenses           sudachi.json     user_dict.dic
README.md        jdartsclone-1.2.0.jar  sudachi-0.6.3.jar  system_core.dic

Thank you for the supplement. I placed the extracted data in the current directory and ran it, but

I still get the same error.
(system_core.dic was not included in "sudachi-0.6.3-executable", so I used 0.6.2).

Instead, try the following.

java -cp sudachi-0.6.3.jar;javax.json-1.1.4.jar;jdartsclone-1.2.0.jar com.worksap.nlp.sudachi.SudachiCommandLine -d

As additional information, I tried changing utf-8/cp932 and character encoding with cmd, but both failed.

If you are running on CMD.EXE, try running it on PowerShell, the JSON quoting may be broken.

I tried to run it in Powershell, but it did not work with the following error.

PS C:\Users\XXXXX\Desktop\code> java -cp ".\sudachi-0.6.3.jar";".\javax.json-1.1.4.jar";".\jdartsclone-1.2.0.jar" com.worksap.nlp.sudachi.SudachiCommandLine -d
発生場所 行:1 文字:83

  • ... \jdartsclone-1.2.0.jar" com.worksap.nlp.sudachi.SudachiCommandLine -d
  •                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    

式またはステートメントのトークン 'com.worksap.nlp.sudachi.SudachiCommandLine' を使用できません。
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : UnexpectedToken

PS C:\Users\XXXXX\Desktop\code> echo ビューティーキラー | java -jar sudachi-0.6.3.jar -s '{"systemDict":" C:/Users/XXXXX/Desktop/code/system_core.dic","userDict":["C:/Users/XXXXX/Desktop/code/user.dic"]}' -d
Exception in thread "main" java.lang.IllegalArgumentException: javax.json.stream.JsonParsingException: Unexpected char 115 at (line no=1, column no=2, offset=1)
at com.worksap.nlp.sudachi.Settings.parse(Settings.java:207)
at com.worksap.nlp.sudachi.Config.fromJsonString(Config.java:193)
at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:225)
Caused by: javax.json.stream.JsonParsingException: Unexpected char 115 at (line no=1, column no=2, offset=1)
at org.glassfish.json.JsonTokenizer.unexpectedChar(JsonTokenizer.java:601)
at org.glassfish.json.JsonTokenizer.nextToken(JsonTokenizer.java:418)
at org.glassfish.json.JsonParserImpl$ObjectContext.getNextEvent(JsonParserImpl.java:466)
at org.glassfish.json.JsonParserImpl.next(JsonParserImpl.java:376)
at org.glassfish.json.JsonParserImpl.getObject(JsonParserImpl.java:335)
at org.glassfish.json.JsonParserImpl.getObject(JsonParserImpl.java:173)
at org.glassfish.json.JsonReaderImpl.read(JsonReaderImpl.java:94)
at com.worksap.nlp.sudachi.Settings.parse(Settings.java:187)
... 2 more

It doesn't seem to work with older PowerShell. You can run it in CMD.EXE by quoting as follows.

java -jar sudachi-0.6.3.jar -s {\"systemDict\":\"c:\\path\\to\\system_core.dic\",\"userDict\":[\"c:\\path\\to\\user.dic\"]} -d

I set the cost to 5000 and now the word in the user dictionary appears.

ビューティーキラー,5146,5146,5000,ビューティーキラー,名詞,普通名詞,一般,*,*,*,ビューティーキラー,ビューティーキラー,*,*,*,*,*
>java -jar sudachi-0.6.3.jar -s {\"systemDict\":\"c:\\path\\to\\system_core.dic\",\"userDict\":[\"c:\\path\\to\\user.dic\"]} bk.txt
ビューティーキラー      名詞,普通名詞,一般,*,*,*        ビューティーキラー
EOS

The execution is as follows: sudachi parameters are not present, does this mean that the result is a different version of something?

■Code

echo ビューティーキラー | java -jar sudachi-0.6.3.jar -s {"systemDict":"c:\Users\XXXXX\Desktop\code\system_core.dic","userDict":["c:\Users\XXXXX\Desktop\code\user.dic"]} –d

■Result

PS C:\Users\XXXXX\Desktop\code> echo ビューティーキラー | java -jar sudachi-0.6.3.jar -s {"systemDict":"c:\Users\XXXXX\Desktop\code\system_core.dic","userDict":["c:\Users\XXXXX\Desktop\code\user.dic"]} -d

発生場所 行:1 文字:119

  • ... ict":"c:\Users\XXXXX\Desktop\code\system_core.dic","userDi ...

  •                                                             ~
    

パラメーター一覧に引数が存在しません。

+ CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException

+ FullyQualifiedErrorId : MissingArgument

This is not a Sudachi issue, but rather a CMD.EXE or PowerShell quoting issue, so please check those manuals for more details.
Double quotes (") must be escaped with a backslash (\). Also backslashes are escaped with a backslash.
In addition, PowerShell requires that the entire argument be enclosed in single quotes (').

CMD.EXE:

java -jar sudachi-0.6.3.jar -s {\"systemDict\":\"c:\\path\\to\\system_core.dic\",\"userDict\":[\"c:\\path\\to\\user.dic\"]}

PowerShell:

java -jar sudachi-0.6.3.jar -s '{\"systemDict\":\"c:\\path\\to\\system_core.dic\",\"userDict\":[\"c:\\path\\to\\user.dic\"]}'

Thank you. But is it just garbled, or when I run it, ?? when I run it. I tried changing the character encoding, but the execution result was not obtained. The following are the results.

PS C:\Users\XXXXX\Desktop\code> echo ビューティーキラー | java -jar sudachi-0.6.3.jar -s '{"systemDict":"c:\Users\XXXXX\Desktop\code\system_core.dic","userDict":["c:\Users\XXXXX\Desktop\code\user.dic"]}' -d
=== Input dump:
?????????
=== Lattice dump:
0: 9 9 (null)(0) BOS/EOS 0 0 0: -739 -739 -739 -739 -739 -739 -739 -447 -739 -2139 -739
1: 0 9 ?????????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 490
2: 1 9 ????????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 585
3: 2 9 ???????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
4: 3 9 ??????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
5: 4 9 ?????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
6: 5 9 ????(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
7: 6 9 ???(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
8: 7 9 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
9: 7 9 ??(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
10: 8 9 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
11: 8 9 ?(0) 名詞,普通名詞,サ変可能,,,* 5129 5129 17094: 95 585
12: 6 8 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
13: 7 8 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
14: 5 7 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
15: 6 7 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
16: 4 6 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
17: 5 6 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
18: 3 5 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
19: 4 5 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
20: 2 4 ?(6593) 補助記号,一般,,,, 5969 5969 2588: -1181 1516
21: 3 4 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
22: 1 3 ?(6593) 補助記号,一般,,,, 5969 5969 2588: 1516
23: 2 3 ?(228) 補助記号,句点,,,, 5970 5970 611: -1143 -678
24: 0 2 ?(6593) 補助記号,一般,,,, 5969 5969 2588: 975
25: 1 2 ?(228) 補助記号,句点,,,, 5970 5970 611: -678
26: 0 1 ?(228) 補助記号,句点,,,, 5970 5970 611: -262
27: 0 0 (null)(0) BOS/EOS 0 0 0: 0

It seems that the character encoding is not correct, and it is difficult to handle character encoding properly in Windows CUI.
Writing the input to a file and parsing it seems to work.

If you are in a Japanese Windows environment, please create the file in Shift-JIS.

Thank you for your reply. I would like to try this with input files. Do you have any kind of sample code(CLI) for writing with input files?

>type bk.txt
ビューティーキラー

>java -jar sudachi-0.6.3.jar -s {\"systemDict\":\"c:\\path\\to\\system_core.dic\",\"userDict\":[\"c:\\path\\to\\user.dic\"]} bk.txt
ビューティーキラー      名詞,普通名詞,一般,*,*,*        ビューティーキラー
EOS

Thank you. I was able to run it successfully, but it does not seem to be recognised from the dictionary.

■Result
PS C:\Users\XXXXX\Desktop\code> java -jar sudachi-0.6.3.jar -s '{"systemDict":"c:\Users\XXXXX\Desktop\code\system_core.dic","userDict":["c:\Users\XXXXX\Desktop\code\user.dic"]}' bk.txt

ビューティー 名詞,普通名詞,一般,,,* ビューティー

キラー 名詞,普通名詞,一般,,,* キラー

EOS

Maybe the cost of the word is too high.

#229 (comment)

Thank you very much. I was able to split it successfully. I therefore close this matter. I apologise for the long.