There is nothing better than better documentation
ikawaha opened this issue · 14 comments
KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.
Originally posted by @CaptainDario in #274 (comment)
CaptainDario Indeed. There is nothing better than better documentation!
ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?
@KEINOS Thanks for the suggestion.
Would it be better to put the details in the wiki and link to it from the README? The wiki of this repository is open and you are free to add to it.
@ikawaha (cc: @CaptainDario)
The wiki of this repository is open and you are free to add to it.
Thank you!
Would it be better to put the details in the wiki and link to it from the README?
I agree. I would like to start with the "keywords per page" of the Wiki. For example, start with "wakati
". We should think more about this when we have more keywords, shouldn't we?
I have no idea 😇 , so let's start with "wakati".
There is extensive documentation on janome, which may be helpful.
@ikawaha (cc: @CaptainDario)
I have finally started editing the Wiki. But I think it is premature to link from the README.md, as I am just copying and pasting the issues.
Ideally, we would like to translate the official Japanese documentation into English. However, for the time being, it would be realistic to add topics one by one to the Wiki and later create a separate repository for kagome-doc
.
Also, I was looking at the official documentation and thought that enriching ExampleXXX
and godoc
would be a Golang approach.
Thank you so much!
It's great! 🙏
Even at this point, we have a few Example tests, and it's a great idea to enrich ExampleXXX
and godoc
.
( '-`).oO( But, they may not work with go-playground because the build timed out 😇.
e.g. Example test for the word filter
Lines 167 to 187 in a16f933
But, they may not work with go-playground because the build timed out 😇.
Yes, indeed. Go Playground is a no-go for kagome
for now 😭
However, as long as godoc
can run ExamplesXXX
, it is worth including whenever possible.
How about creating an _example
directory and putting some working examples there? Along with Wiki and godoc
improvements, of course.
- Example @ go-sqlite3
How about creating an _example directory and putting some working examples there? Along with Wiki and godoc improvements, of course.
It sounds good 👍.
I created ./sample/_exmple
folder for adding working examples in PR #296.
@KEINOS I am currently playing around with the different dictionaries. While doing this I figured out that, when using unidic processing:
私は日本人です。
Results in
[代名詞, *, *, *, *, *, ワタクシ, 私-代名詞, 私, ワタクシ, 私, ワタクシ, 和, *, *, *, *],
[助詞-係助詞, 係助詞, *, *, *, *, ハ, は, は, , は, ワ, 和, *, *, *, *],
[名詞-固有名詞-地名-国, 固有名詞, 地名, 国, *, *, ニッポン, 日本, 日本, ニッポン, 日本, ニッポン, 固, *, *, *, *],
[接尾辞-名詞的-一般, 名詞的, 一般, *, *, *, ニン, 人, 人, ニン, 人, ニン, 漢, *, *, *, *],
[助動詞, *, *, *, 助動詞-デス, 終止形-一般, デス, です, です, , です, デス, 和, *, *, *, *],
[補助記号-句点, 句点, *, *, *, *, , 。, 。, , 。, , 記号, *, *, *, *]
Notice: ワタクシ, ニン
However, when running with ipadic the result is
[名詞, 代名詞, 一般, *, *, *, 私, ワタシ, ワタシ],
[助詞, 係助詞, *, *, *, *, は, ハ, ワ],
[名詞, 一般, *, *, *, *, 日本人, ニッポンジン, ニッポンジン],
[助動詞, *, *, *, 特殊・デス, 基本形, です, デス, デス], [記号, 句点, *, *, *, *, 。, 。, 。]
Notice: ワタシ, ニッポンジン
I think the results from using ipadic are clearly better.
While I really appreciate your previous answer (and creating the wiki), could I ask you to elaborate a bit more what the disadvantages/advantages of the different dictionaries are?
I though
Accuracy of results: ipadic < unidic < neologd
Size / speed: neologd < unidic < ipadic
But that seems to not reallly hold.
I though
Accuracy of results: ipadic < unidic < neologd
Size / speed: neologd < unidic < ipadic
But that seems to not reallly hold.
As you point out, the size of the dictionary is proportional to its speed, but not to its size and accuracy.
In my personal experience, I believe that they can be classified as follows:
- Size:
ipadic
<unidic
<neologd
- Speed:
neologd
<unidic
<ipadic
- Accuracy:
- grammar analysis:
unidic
<ipadic
<neologd
- word split by proper noun:
ipadic
<unidic
<neologd
- word split by general-purpose:
neologd
<ipadic
<unidic
- grammar analysis:
This is because each dictionary is created for a different purpose and requires different precision.
what the disadvantages/advantages of the different dictionaries are?
tl; dr
In summary, IPADIC is typically used for grammatical analysis and UNIDIC for retrieval analysis. IPADIC is lightweight and accurate in most use cases and UNIDIC is good for word-splitting for word search purposes.
IPADIC is recommended when part of speech (PoS) is important.
For example, when PoS is used as an information vector for analysis, machine learning, or etc. And NEologd is a kind of IPADIC + user dictionary. This dictionary has been extended by the community to cover the new vocabulary missing in IPADIC. However, it is huge.
UNIDIC, on the other hand, is recommended when it is necessary to split a sentence into smaller example units for retrieval. Search engines, for example.
When a search engine needs to measure the distance between the divided units. Levenshtein distance or Cosine similarity for example. Or, using each unit ID (word ID? token?) as a discrete feature value for machine learning.
Depending on what and how you are analyzing, in my opinion, I would recommend using IPADIC plus a home-made user dictionary.
ts; dr
Disadvantage of UNIDIC
As you may have already experienced, you may be uncomfortable with the difference in accuracy and speed of delimitation. Compared to IPADIC, UNIDIC seems to be less accurate despite its larger amount of information (larger dictionary size).
$ # IPA DICT
$ time echo "私は日本人です。" | kagome -sysdict ipa
私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
EOS
real 0m1.021s
user 0m1.114s
sys 0m0.090s
$ # UNI DICT
$ time echo "私は日本人です。" | kagome -sysdict uni
私 代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人 接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です 助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
real 0m4.807s
user 0m5.303s
sys 0m0.273s
The problem here is the difference between "日本人
" and "日本
, 人
".
UNIDIC is a dictionary based on "short units" (短単位
) defined by the NINJAL to facilitate the collection of examples for the BCCWJ.
- NINJAL (National Institute of Japanese Language and Linguistics)
- BCCWJ (Balanced Corpus of Contemporary Written Japanese)
This "short units" is known that the division is too short to be used in "natural language processing" for syntactic and semantic analysis.
Thus, in most use cases, IPADIC is faster and more convenient. This is why my recommendation is to use IPADIC with a custom user dictionary.
Advantage and use cases of UNIDIC
An advantage of UNIDIC is the "consistency" in word segmentation.
The difference between the two dictionaries, IPA
and UNI
, is illustrated by a well-known example.
"
りんごジュースを飲んだ。
" vs "リンゴジュースを飲んだ。
"
Both are correct and mean the same thing, such as "I drank apple juice".
But, sensibly, "りんごジュース
" is easier to read than "リンゴジュース
" because the words are visually separated (katakana-hiranaga-mixture vs all-in-katakana).
And both dictionaries include the word "りんご
" and "リンゴ
" as a noun (名詞
).
$ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご 名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS
$ echo "リンゴ" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS
$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS
$ echo "リンゴ" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS
And here comes the problem.
$ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん 副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご 接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Note the difference between "りん
, ご
" and "りんご
".
IPADIC recognized "りんご
" as an adverb/prefix (副詞
/接頭詞
) combination and UNIDIC as a noun (名詞
).
The simplest solution, apart from registering a user dictionary, is to use katakana notation.
$ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース 名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん 動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ 助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。 記号,句点,*,*,*,*,。,。,。
EOS
$ # UNIDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース 名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん 動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ 助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
The difference is that IPADIC attempted to interpret them grammatically, while UNIDIC interpreted them in short units.
- "
日本人
" (noun) vs "日本
,人
" (noun + postfix) - "
りん
,ご
,ジュース
" (adverb + prefix + noun) vs "りんご
,ジュース
" (noun+noun)
In both cases, the latter delimitation is divided into units suitable for search engines, etc.
This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems.
Thus, UNIDIC has more advantage for word searching purposes.
Are you convinced by this explanation? > @CaptainDario
Am I on the right track in my explanation? > @ikawaha
Let me know so I can fix it and add it to the Wiki.
@KEINOS well first of all thank you for this very detailed explanation. It really helped me a lot!
I think this should definitely be added to the wiki, for starters this is gold.
In your opinion, is neologd worth it over standard ipadic for Japanese NLP?
It really helped me a lot!
I think this should definitely be added to the wiki, for starters this is gold.
I'm glad to hear that! So far so good. 👍
In your opinion, is neologd worth it over standard ipadic for Japanese NLP?
Neologd is a great dictionary. However, for my current usage, I choose IPADIC. If speed is not important, it is worth using Neologd, which is just an extension of IPADIC.
Actually, there is a Japanese text linter implemented in Javascript, but due to speed issues and the need to install Node.js separately, I was secretly struggling to implement it in Go with Kagome.
However, the dictionary lookup part seems to be the bottleneck, and even a simple test implementation using Neologed, its speed is not as good as the original Textlint. So I'm currently losing motivation to build a text linter in Go.
I wish I could help speed up Kagome, but I just started learning Go in earnest after this Corona disaster thing, so I can't keep up with its technology yet. 😭 Documenting is the only thing I can contribute for now.
@CaptainDario (cc: @ikawaha )
FYI, I added the FAQ and a document about it to the wiki. Feel free to fix them!
JFYI. I added the below article to the Wiki.
- Kagome As a Server Side Tokenizer | Wiki | kagome @ GitHub