Arabic diacritical marks splitting single words into two separate morphs
Closed this issue · 12 comments
Describe the bug
When ankimorphs analyzes Arabic sentences that have diacritical marks on them, it splits the word into two separate morphs. It seems like these diacritical characters are treated as "spaces" by the AnkiMorphs: Language w/ Spaces morphemizer.
Steps to reproduce the behavior
(apologies for formatting or other errors; I'm new to this)
I have many sentences in my analyzed fields that have Arabic diacritical markings. For instance, I have the sentence:
أنا بصدّقك وإحنا دايماً منقلكم
"I believe you and we always say to you..."
When dividing this into morphs, it takes the diacritical characters as word dividers, in this case ّ and ً ... in the latter case, it is at the end of the word so it doesn't disrupt too much, but in the case of ّ this divides the word in half, resulting in the following list of morphs in the am-unknowns field; I have bolded the now non-sensical word chunks:
أنا, بصد, دايما, قك, منقلكم, وإحنا
This is obviously a problem, as now those two word fragments mean nothing.
Here is a more dire example:
تضحك ، وتُغَمِّضُ عينيها ، وتَحْمَرُّ وَجنَتاها
She laughs, and closes her eyes, and her cheeks redden
Which results in this with only 2 words left:
تاها, تضحك, جن, ح, ر, ض, عينيها, غ, م, و, وت
Expected behavior
What I had expected would happen is that words would be divided only by spaces and punctuation. So in my second example, it would have resulted in:
تضحك, وتُغَمِّضُ, عينيها, وتَحْمَرُّ, وَجنَتاها
This is what I was expecting, knowing that I was just using "spaces" and "collection frequency" since there are not (as far as I could find) any morphemizer or lemma resources for Arabic. So I knew that this would be a crude operation in the first place.
Now that I'm writing this, one possible solution to this would be to "clean" the words of any diacritical marks, which would let words with the same base spelling but different diacritical marks -- or none at all -- be grouped as the same morphs (surface forms in this case, since I don't have a morphological analyzer). That solution would result in the following:
تضحك، وتغمض، عينيها، وتحمر، وجنتاها
This solution would result in some combinations of words that are spelled with the same letters but have different pronunciations and/or meanings, so it may not be desirable for all users in all cases. Probably best to leave such things up to a morpheme/lemma document that may be created in the future.
But it seems like initially those diacritical markings should not be treated as "Spaces" to start with.
The markings that I would want kept as part of words are:
shadda: ّ
fatHa: َ
Damma: ُ
kasra: ِ
tanween al-fatH: ً
tanween aD-Damm: ٌ
tanween al-kasr: ٍ
sukoon: ْ
Any help with this greatly appreciated.
My AnkiMorphs settings
{
"filters": [
{
"extra_highlighted": true,
"extra_score": true,
"extra_unknowns": true,
"extra_unknowns_count": true,
"field": "Back",
"modify": true,
"morph_priority": "Collection frequency",
"morphemizer_description": "AnkiMorphs: Language w/ Spaces",
"note_type": "ArabicArabic-AudioFront",
"read": true,
"tags": {
"exclude": [],
"include": []
}
},
{
"extra_highlighted": true,
"extra_score": true,
"extra_unknowns": true,
"extra_unknowns_count": true,
"field": "Front",
"modify": true,
"morph_priority": "Collection frequency",
"morphemizer_description": "AnkiMorphs: Language w/ Spaces",
"note_type": "ArabicArabicFrontBackSimple",
"read": true,
"tags": {
"exclude": [],
"include": []
}
},
{
"extra_highlighted": true,
"extra_score": true,
"extra_unknowns": true,
"extra_unknowns_count": true,
"field": "Text",
"modify": true,
"morph_priority": "Collection frequency",
"morphemizer_description": "AnkiMorphs: Language w/ Spaces",
"note_type": "Cloze-RTL",
"read": true,
"tags": {
"exclude": [],
"include": []
}
},
{
"extra_highlighted": true,
"extra_score": true,
"extra_unknowns": true,
"extra_unknowns_count": true,
"field": "Main Arabic Entry",
"modify": true,
"morph_priority": "Collection frequency",
"morphemizer_description": "AnkiMorphs: Language w/ Spaces",
"note_type": "Arabic Vocabulary",
"read": true,
"tags": {
"exclude": [],
"include": []
}
},
{
"extra_highlighted": true,
"extra_score": true,
"extra_unknowns": true,
"extra_unknowns_count": true,
"field": "Expression",
"modify": true,
"morph_priority": "Collection frequency",
"morphemizer_description": "AnkiMorphs: Language w/ Spaces",
"note_type": "subs2srs",
"read": true,
"tags": {
"exclude": [],
"include": []
}
}
],
"preprocess_ignore_bracket_contents": false,
"preprocess_ignore_names_morphemizer": false,
"preprocess_ignore_names_textfile": false,
"preprocess_ignore_round_bracket_contents": false,
"preprocess_ignore_slim_round_bracket_contents": false,
"preprocess_ignore_suspended_cards_content": false,
"recalc_due_offset": 500000,
"recalc_interval_for_known": 21,
"recalc_move_known_new_cards_to_the_end": false,
"recalc_number_of_morphs_to_offset": 100,
"recalc_offset_new_cards": false,
"recalc_on_sync": true,
"recalc_read_known_morphs_folder": false,
"recalc_suspend_known_new_cards": false,
"recalc_toolbar_stats_use_known": false,
"recalc_toolbar_stats_use_seen": true,
"recalc_unknowns_field_shows_inflections": true,
"recalc_unknowns_field_shows_lemmas": false,
"shortcut_browse_all_same_unknown": "Shift+L",
"shortcut_browse_ready_same_unknown": "L",
"shortcut_browse_ready_same_unknown_lemma": "Ctrl+Shift+L",
"shortcut_generators": "Ctrl+Shift+G",
"shortcut_known_morphs_exporter": "Ctrl+Shift+E",
"shortcut_learn_now": "Ctrl+Alt+N",
"shortcut_recalc": "Ctrl+M",
"shortcut_set_known_and_skip": "K",
"shortcut_settings": "Ctrl+O",
"shortcut_view_morphemes": "Ctrl+Alt+V",
"skip_only_known_morphs_cards": true,
"skip_show_num_of_skipped_cards": true,
"skip_unknown_morph_seen_today_cards": true,
"tag_known_automatically": "am-known-automatically",
"tag_known_manually": "am-known-manually",
"tag_learn_card_now": "am-learn-card-now",
"tag_not_ready": "am-not-ready",
"tag_ready": "am-ready"
}
My system
- Operating System: Windows 10
- Anki Version: 23.12.1
- AnkiMorphs Version: 2.2.5
Interesting.
It seems it gets broken by this regex:
anki-morphs/ankimorphs/morphemizer.py
Lines 130 to 133 in cf081b8
it produces this:
expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصد
word: قك
word: وإحنا
word: دايما
word: منقلكم
but if the regex is just replaced with:
word_list = [word.lower() for word in expression.split()]
then it looks like it works (to me at least)
expression: أنا بصدّقك وإحنا دايماً منقلكم
word: أنا
word: بصدّقك
word: وإحنا
word: دايماً
word: منقلكم
does that look right to you?
Yes that looks like it is working as expected.
Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?
Could the regex be modified to extend the: \w (word characters) with \u0610-\u061A\u064B-\u065F (Arabic diacritical marks)?
Absolutely, thanks!
I'll include this in the v3 update, which will probably be released in ~2 weeks.
thanks to all. this addon is amazing and i'm very grateful for your generosity in creating this.
Great! 🎉
I think this will solve a similar problem with Tamil, which I didn’t have the time to look into. If it isn’t solved, and persists after V3 is released, I will open a new issue.
Sorry about the delay. Anyway, inserting the unicode characters into the regex pattern almost works, but not quite. Instead of trying to cram all exceptions into a single regex pattern and making the entire morphemizer more fragile, I'll instead just add a new morphemizer that only splits on spaces, i.e.:
word_list = [word.lower() for word in expression.split()]
The new morphemizer will be called "Simple Space Splitter" and the old one will be called "SSS + Punctuations"
v3 is now live: https://github.com/mortii/anki-morphs/releases/tag/v3.0.0
Thanks for the feedback!
Version 3.0.0 indeed solves this problem with the Arabic script (tested with Persian and South Levantine Arabic) and the Tamil script 🥳
It introduces another problem, though, namely that with Ankimorphs: Simple Space Splitter
punctuation becomes an integral part of the morph. For example, "است.", "است،", "است؟", "است!" and "است" in Persian are all considered distinct morphs, while it would be best if they were unified as the same morph without the punctuation ("است"). Can’t we just remove a given set of characters when processing the field, using a regex or some other method? This page has lists of punctuation marks, but they have way too many symbols for practical use with Ankimorphs; for example, the ‘Punctuation, Other’ category has no less than 628 characters, the vast majority of which aren’t likely to ever be used with Ankimorphs. I suggest adding a configurable list of punctuation characters (as a textbox/string), with reasonable defaults (including Latin, Greek, Armenian, Hebrew, Arabic and Devanagari for example, but excluding Mandaic). This way the list of punctuation characters won’t be too complex and will accommodate most users (and as the list is configurable, if someone does learn Mandaic with Ankimorphs, they can add the relevant characters easily…).
As a workaround, one can add a dedicated field, copy the target-language field to the new field (using Change Note Type, Ctrl+Shift+m) for all notes, and then search and replace all punctuation characters to nothing using a regular expression. Now Ankimorphs can be set to consider this new field, which has no punctuation marks. This is cumbersome, but it should work.
Can’t we just remove a given set of characters when processing the field, using a regex or some other method?
Good idea. I created a discussion thread for this: #273
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.