Incomplete Japanese parsing when compared to MorphMan

Question

Incomplete Japanese parsing when compared to MorphMan

Closed this issue 2 months ago · 2 comments

Describe the bug

There is a large difference in how anki-morphs parses Japanese sentences compared to MorphMan. I am using the recommended ankimorphs-japanese-mecab add-on for anki-morphs and used MeCab UniDic Japanese Dictionary add-on with MorphMan as well. Despite both using MeCab, anki-morphs misses a large amount of words when viewing the morphemes for a card that MorphMan correctly parses. Strangely, some of the missing morphemes show up priority files. The examples provided aren't exhaustive.

Steps to reproduce the behavior

Have AnkiMophs: Japanese as the morphemizer for the card type
Have cards with sentences such as:

キノはワイルドライススープを,
ネコザメだ
C-5Mスーパーギャラクシーです
敵影なし
村人を一人一人作るんだよ

Right click card
Click View Morphemes
See the morphemes are as follows:

キノ, は, を (priority output includes ワイルド, ライス, and スープ)
だ (priority output includes スーパー)
です
敵, 影, なし
村人, を, 一, 人, 作る, ん, だ, よ (priority output includes 一人ひとり but not 一人一人)

Expected behavior

The morphemes for the above cards should be as follows:

キノ, は, ワイルド, ライス, スープ, を
ネコザメ, だ
スーパー, ギャラクシー, です
敵影, なし
村人, を, 一人一人, 作る, ん, だ, よ

My AnkiMorphs settings

{
    "algorithm_all_morphs_target_difference_weight": 10,
    "algorithm_average_priority_all_morphs_weight": 0,
    "algorithm_average_priority_learning_morphs_weight": 0,
    "algorithm_learning_morphs_target_difference_weight": 10,
    "algorithm_lower_target_all_morphs": 4,
    "algorithm_lower_target_all_morphs_coefficient_a": 0.0,
    "algorithm_lower_target_all_morphs_coefficient_b": 1.0,
    "algorithm_lower_target_all_morphs_coefficient_c": 0.0,
    "algorithm_lower_target_learning_morphs": 1,
    "algorithm_lower_target_learning_morphs_coefficient_a": 6.0,
    "algorithm_lower_target_learning_morphs_coefficient_b": 0.0,
    "algorithm_lower_target_learning_morphs_coefficient_c": 0.0,
    "algorithm_total_priority_all_morphs_weight": 1,
    "algorithm_total_priority_learning_morphs_weight": 10,
    "algorithm_total_priority_unknown_morphs_weight": 10,
    "algorithm_upper_target_all_morphs": 6,
    "algorithm_upper_target_all_morphs_coefficient_a": 1.0,
    "algorithm_upper_target_all_morphs_coefficient_b": 0.0,
    "algorithm_upper_target_all_morphs_coefficient_c": 0.0,
    "algorithm_upper_target_learning_morphs": 2,
    "algorithm_upper_target_learning_morphs_coefficient_a": 6.0,
    "algorithm_upper_target_learning_morphs_coefficient_b": 0.0,
    "algorithm_upper_target_learning_morphs_coefficient_c": 0.0,
    "evaluate_morph_inflection": true,
    "evaluate_morph_lemma": false,
    "extra_fields_display_inflections": true,
    "extra_fields_display_lemmas": false,
    "filters": [
        {
            "extra_all_morphs": false,
            "extra_all_morphs_count": false,
            "extra_highlighted": true,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_study_morphs": true,
            "extra_unknown_morphs": true,
            "extra_unknown_morphs_count": true,
            "field": "Expression",
            "modify": true,
            "morph_priority_selection": "books.csv",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "Targeted Sentence",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        },
        {
            "extra_all_morphs": false,
            "extra_all_morphs_count": false,
            "extra_highlighted": false,
            "extra_score": false,
            "extra_score_terms": false,
            "extra_study_morphs": false,
            "extra_unknown_morphs": false,
            "extra_unknown_morphs_count": false,
            "field": "Target",
            "modify": false,
            "morph_priority_selection": "books.csv",
            "morphemizer_description": "AnkiMorphs: Japanese",
            "note_type": "Japanese-75658",
            "read": true,
            "tags": {
                "exclude": [],
                "include": []
            }
        }
    ],
    "hide_inflection_toolbar": false,
    "hide_lemma_toolbar": false,
    "hide_recalc_toolbar": false,
    "interval_for_known_morphs": 21,
    "preprocess_custom_characters_to_ignore": "",
    "preprocess_ignore_bracket_contents": true,
    "preprocess_ignore_custom_characters": false,
    "preprocess_ignore_names_morphemizer": false,
    "preprocess_ignore_names_textfile": false,
    "preprocess_ignore_round_bracket_contents": true,
    "preprocess_ignore_slim_round_bracket_contents": false,
    "preprocess_ignore_suspended_cards_content": true,
    "read_known_morphs_folder": false,
    "recalc_due_offset": 500000,
    "recalc_move_known_new_cards_to_the_end": false,
    "recalc_number_of_morphs_to_offset": 100,
    "recalc_offset_new_cards": false,
    "recalc_on_sync": false,
    "recalc_suspend_known_new_cards": false,
    "shortcut_browse_all_same_unknown": "Shift+L",
    "shortcut_browse_ready_same_unknown": "L",
    "shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L",
    "shortcut_generators": "Ctrl+Shift+G",
    "shortcut_known_morphs_exporter": "Ctrl+Shift+E",
    "shortcut_learn_now": "Ctrl+Alt+N",
    "shortcut_progression": "Ctrl+Alt+P",
    "shortcut_recalc": "Ctrl+Shift+R",
    "shortcut_set_known_and_skip": "K",
    "shortcut_settings": "Ctrl+Shift+S",
    "shortcut_view_morphemes": "Ctrl+Alt+V",
    "skip_only_known_morphs_cards": true,
    "skip_show_num_of_skipped_cards": true,
    "skip_unknown_morph_seen_today_cards": true,
    "tag_fresh": "am-fresh-morphs",
    "tag_known_automatically": "am-known-automatically",
    "tag_known_manually": "am-known-manually",
    "tag_learn_card_now": "am-learn-card-now",
    "tag_not_ready": "am-not-ready",
    "tag_ready": "am-ready",
    "toolbar_stats_use_known": false,
    "toolbar_stats_use_seen": true
}

My system

Operating System: Manjaro Linux
Anki Version: 24.06.3 (d678e393)⁩
AnkiMorphs Version: 3.2.0

Additional context

Below is how MorphMan parses those sentences:

キノはワイルドライススープを,
0	キノ-kino	キノ	キノ	キノ	名詞	普通名詞
0	は	は	は	ハ	助詞	係助詞
0	ワイルド-wild	ワイルド	ワイルド	ワイルド	形状詞	一般
0	ライス-rice	ライス	ライス	ライス	名詞	普通名詞
0	スープ-soup	スープ	スープ	スープ	名詞	普通名詞
0	を	を	を	ヲ	助詞	格助詞

ネコザメだ
0	猫鮫	ネコザメ	ネコザメ	ネコザメ	名詞	普通名詞
0	だ	だ	だ	ダ	助動詞	*

C-5Mスーパーギャラクシーです
0	スーパー-super	スーパー	スーパー	スーパー	名詞	普通名詞
0	ギャラクシー-galaxy	ギャラクシー	ギャラクシー	ギャラクシー	名詞	普通名詞
0	です	です	です	デス	助動詞

敵影なし
0	敵影	敵影	敵影	テキエイ	名詞	普通名詞
0	無し	なし	なし	ナシ	名詞	普通名詞

村人を一人一人作るんだよ
0	村人	村人	村人	ムラビト	名詞	普通名詞
0	を	を	を	ヲ	助詞	格助詞
0	一人一人	一人一人	一人一人	ヒトリヒトリ	名詞	普通名詞
0	作る	作る	作る	ツクル	動詞	一般
0	の	ん	ん	ノ	助詞	準体助詞
0	だ	だ	だ	ダ	助動詞	*
0	よ	よ	よ	ヨ	助詞	終助詞

Answer 1 · 2024-10-04T10:49:02.000Z

The ankimorphs-japanese-mecab add-on uses a non-unidic version of mecab iirc, which is generally better at preserving words due to it's less aggressive splitting, although it's not without its own flaws.

The unidic version is almost indistinguishable from the japanese spaCy morphemizers imo, so when support for spacy was added, I removed support for the unidic version and massively improved the codebase as a result.

Here is a comparison between them:

Sentence	(non-unidic) MeCab Morphs	spaCy (ja-core-news-sm) Morphs
キノはワイルドライススープを,	は, を, キノ	は, を, キノ, スープ, ライス, ワイルド
C-5Mスーパーギャラクシーです	です	-, 5, c, m, です, ギャラクシー, スーパー
ネコザメだ	だ	だ, ネコザメ
敵影なし	なし, 影, 敵	なし, 敵影
村人を一人一人作るんだよ	だ, よ, を, ん, 一, 人, 作る, 村人	だ, よ, を, ん, 一人一人, 作る, 村人

I haven't saved a list of examples where the non-unidic mecab morphemizer produces better results, but that list is definitely long, so the morphemizer choice basically boils down to preferences and which tolerances you have--one is not always superior to the other.

If you haven't tried spacy yet, give it a go and see if it gives you better results. However, I won’t be adding support for unidic mecab, sorry about that 🙏

Answer 2 · 2024-10-12T02:55:20.000Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.