Incomplete Japanese parsing when compared to MorphMan
Closed this issue · 2 comments
Describe the bug
There is a large difference in how anki-morphs
parses Japanese sentences compared to MorphMan
. I am using the recommended ankimorphs-japanese-mecab
add-on for anki-morphs
and used MeCab UniDic Japanese Dictionary
add-on with MorphMan
as well. Despite both using MeCab
, anki-morphs
misses a large amount of words when viewing the morphemes for a card that MorphMan
correctly parses. Strangely, some of the missing morphemes show up priority files. The examples provided aren't exhaustive.
Steps to reproduce the behavior
- Have
AnkiMophs: Japanese
as the morphemizer for the card type - Have cards with sentences such as:
- キノはワイルドライススープを,
- ネコザメだ
- C-5Mスーパーギャラクシーです
- 敵影なし
- 村人を一人一人作るんだよ
- Right click card
- Click
View Morphemes
- See the morphemes are as follows:
- キノ, は, を (priority output includes ワイルド, ライス, and スープ)
- だ (priority output includes スーパー)
- です
- 敵, 影, なし
- 村人, を, 一, 人, 作る, ん, だ, よ (priority output includes 一人ひとり but not 一人一人)
Expected behavior
The morphemes for the above cards should be as follows:
- キノ, は, ワイルド, ライス, スープ, を
- ネコザメ, だ
- スーパー, ギャラクシー, です
- 敵影, なし
- 村人, を, 一人一人, 作る, ん, だ, よ
My AnkiMorphs settings
{
"algorithm_all_morphs_target_difference_weight": 10,
"algorithm_average_priority_all_morphs_weight": 0,
"algorithm_average_priority_learning_morphs_weight": 0,
"algorithm_learning_morphs_target_difference_weight": 10,
"algorithm_lower_target_all_morphs": 4,
"algorithm_lower_target_all_morphs_coefficient_a": 0.0,
"algorithm_lower_target_all_morphs_coefficient_b": 1.0,
"algorithm_lower_target_all_morphs_coefficient_c": 0.0,
"algorithm_lower_target_learning_morphs": 1,
"algorithm_lower_target_learning_morphs_coefficient_a": 6.0,
"algorithm_lower_target_learning_morphs_coefficient_b": 0.0,
"algorithm_lower_target_learning_morphs_coefficient_c": 0.0,
"algorithm_total_priority_all_morphs_weight": 1,
"algorithm_total_priority_learning_morphs_weight": 10,
"algorithm_total_priority_unknown_morphs_weight": 10,
"algorithm_upper_target_all_morphs": 6,
"algorithm_upper_target_all_morphs_coefficient_a": 1.0,
"algorithm_upper_target_all_morphs_coefficient_b": 0.0,
"algorithm_upper_target_all_morphs_coefficient_c": 0.0,
"algorithm_upper_target_learning_morphs": 2,
"algorithm_upper_target_learning_morphs_coefficient_a": 6.0,
"algorithm_upper_target_learning_morphs_coefficient_b": 0.0,
"algorithm_upper_target_learning_morphs_coefficient_c": 0.0,
"evaluate_morph_inflection": true,
"evaluate_morph_lemma": false,
"extra_fields_display_inflections": true,
"extra_fields_display_lemmas": false,
"filters": [
{
"extra_all_morphs": false,
"extra_all_morphs_count": false,
"extra_highlighted": true,
"extra_score": false,
"extra_score_terms": false,
"extra_study_morphs": true,
"extra_unknown_morphs": true,
"extra_unknown_morphs_count": true,
"field": "Expression",
"modify": true,
"morph_priority_selection": "books.csv",
"morphemizer_description": "AnkiMorphs: Japanese",
"note_type": "Targeted Sentence",
"read": true,
"tags": {
"exclude": [],
"include": []
}
},
{
"extra_all_morphs": false,
"extra_all_morphs_count": false,
"extra_highlighted": false,
"extra_score": false,
"extra_score_terms": false,
"extra_study_morphs": false,
"extra_unknown_morphs": false,
"extra_unknown_morphs_count": false,
"field": "Target",
"modify": false,
"morph_priority_selection": "books.csv",
"morphemizer_description": "AnkiMorphs: Japanese",
"note_type": "Japanese-75658",
"read": true,
"tags": {
"exclude": [],
"include": []
}
}
],
"hide_inflection_toolbar": false,
"hide_lemma_toolbar": false,
"hide_recalc_toolbar": false,
"interval_for_known_morphs": 21,
"preprocess_custom_characters_to_ignore": "",
"preprocess_ignore_bracket_contents": true,
"preprocess_ignore_custom_characters": false,
"preprocess_ignore_names_morphemizer": false,
"preprocess_ignore_names_textfile": false,
"preprocess_ignore_round_bracket_contents": true,
"preprocess_ignore_slim_round_bracket_contents": false,
"preprocess_ignore_suspended_cards_content": true,
"read_known_morphs_folder": false,
"recalc_due_offset": 500000,
"recalc_move_known_new_cards_to_the_end": false,
"recalc_number_of_morphs_to_offset": 100,
"recalc_offset_new_cards": false,
"recalc_on_sync": false,
"recalc_suspend_known_new_cards": false,
"shortcut_browse_all_same_unknown": "Shift+L",
"shortcut_browse_ready_same_unknown": "L",
"shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L",
"shortcut_generators": "Ctrl+Shift+G",
"shortcut_known_morphs_exporter": "Ctrl+Shift+E",
"shortcut_learn_now": "Ctrl+Alt+N",
"shortcut_progression": "Ctrl+Alt+P",
"shortcut_recalc": "Ctrl+Shift+R",
"shortcut_set_known_and_skip": "K",
"shortcut_settings": "Ctrl+Shift+S",
"shortcut_view_morphemes": "Ctrl+Alt+V",
"skip_only_known_morphs_cards": true,
"skip_show_num_of_skipped_cards": true,
"skip_unknown_morph_seen_today_cards": true,
"tag_fresh": "am-fresh-morphs",
"tag_known_automatically": "am-known-automatically",
"tag_known_manually": "am-known-manually",
"tag_learn_card_now": "am-learn-card-now",
"tag_not_ready": "am-not-ready",
"tag_ready": "am-ready",
"toolbar_stats_use_known": false,
"toolbar_stats_use_seen": true
}
My system
- Operating System: Manjaro Linux
- Anki Version: 24.06.3 (d678e393)
- AnkiMorphs Version: 3.2.0
Additional context
Below is how MorphMan parses those sentences:
キノはワイルドライススープを,
0 キノ-kino キノ キノ キノ 名詞 普通名詞
0 は は は ハ 助詞 係助詞
0 ワイルド-wild ワイルド ワイルド ワイルド 形状詞 一般
0 ライス-rice ライス ライス ライス 名詞 普通名詞
0 スープ-soup スープ スープ スープ 名詞 普通名詞
0 を を を ヲ 助詞 格助詞
ネコザメだ
0 猫鮫 ネコザメ ネコザメ ネコザメ 名詞 普通名詞
0 だ だ だ ダ 助動詞 *
C-5Mスーパーギャラクシーです
0 スーパー-super スーパー スーパー スーパー 名詞 普通名詞
0 ギャラクシー-galaxy ギャラクシー ギャラクシー ギャラクシー 名詞 普通名詞
0 です です です デス 助動詞
敵影なし
0 敵影 敵影 敵影 テキエイ 名詞 普通名詞
0 無し なし なし ナシ 名詞 普通名詞
村人を一人一人作るんだよ
0 村人 村人 村人 ムラビト 名詞 普通名詞
0 を を を ヲ 助詞 格助詞
0 一人一人 一人一人 一人一人 ヒトリヒトリ 名詞 普通名詞
0 作る 作る 作る ツクル 動詞 一般
0 の ん ん ノ 助詞 準体助詞
0 だ だ だ ダ 助動詞 *
0 よ よ よ ヨ 助詞 終助詞
The ankimorphs-japanese-mecab
add-on uses a non-unidic version of mecab iirc, which is generally better at preserving words due to it's less aggressive splitting, although it's not without its own flaws.
The unidic version is almost indistinguishable from the japanese spaCy morphemizers imo, so when support for spacy was added, I removed support for the unidic version and massively improved the codebase as a result.
Here is a comparison between them:
Sentence | (non-unidic) MeCab Morphs | spaCy (ja-core-news-sm) Morphs |
---|---|---|
キノはワイルドライススープを, | は, を, キノ | は, を, キノ, スープ, ライス, ワイルド |
C-5Mスーパーギャラクシーです | です | -, 5, c, m, です, ギャラクシー, スーパー |
ネコザメだ | だ | だ, ネコザメ |
敵影なし | なし, 影, 敵 | なし, 敵影 |
村人を一人一人作るんだよ | だ, よ, を, ん, 一, 人, 作る, 村人 | だ, よ, を, ん, 一人一人, 作る, 村人 |
I haven't saved a list of examples where the non-unidic mecab morphemizer produces better results, but that list is definitely long, so the morphemizer choice basically boils down to preferences and which tolerances you have--one is not always superior to the other.
If you haven't tried spacy yet, give it a go and see if it gives you better results. However, I won’t be adding support for unidic mecab, sorry about that 🙏
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.