/KurdishHunspell

A morphological analyzer and spell checker for Kurdish in Hunspell

Primary LanguagePythonOtherNOASSERTION

Hunspell for Kurdish

A morphological analyzer and spell checker for Kurdish in Hunspell (Sorani and Kurmanji🆕)


Latest update on April 28th, 2022

  • Morphosyntactic tags, i.e. po
  • Inflectional tags, i.e. is
  • Stems, i.e. st (covering all part-of-speech tags from version 0.1.3 / verbal stems added in Version 0.1.2)
  • Lemmas, i.e. lem
  • Creating the plugin for Microsoft Office and LibreOffice (check out extensions folder!)
  • ✨ Hunspell is now available for Kurmanji Kurdish as well! 🎉🥳
  • Derivational tags, i.e. ds

Hunspell is a spell checker and morphological analyzer originally designed for languages with rich morphology and complex word compounding. An open-source software, it is widely used by various web browsers and text editors. This repository contains an implementation of the Kurdish morphological rules and annotated lexicon for the task of spell-checking and morphological analysis. To use these functionalities, see Kurdish Language Processing Toolkit (KLPT). Moreover, this spell-checker is currently being added as an extension to LibreOffice and OpenOffice and therefore, can be used within many text editors and browsers as well.

The project was initially created for Sorani Kurdish in early 2020. As of April 2022, a similar implementation is also provided for Kurmanji. It should be noted that the current project is the outcome of months of volunteer research and implementation. Please respect the terms of the license below and don't forget to recognize hours of dictionary tagging and extraction of morphological rules! See below to find out how you can be a sponsor of this project.

Morphological rules

Kurdish morphology, particularly that of the Sorani dialect, is notoriously complex. This is not only due to the number of affixes and clitics, but the way they appear and interact within a word-form. The following is an example in Sorani on such a complexity for a single-word verb where the base girt of the verb girtin 'to take, to get' appears with clitics, suffixes and a verbal particle. The placement of the endoclitic =îş (in green boxes) and agent marker =im (in blue boxes) varies with respect to the base and each other in the verb form.

alt text

In order to extract morphological rules, the morphology of Kurdish is studied in a formal way in the paper entitled A Formal Description of Sorani Kurdish Morphology. This formalization allows various morpho-syntactic features of Kurdish to be represented as rules which are presented in the ckb-Arab.aff and kmr-Latn.aff files.

  • Regarding the Sorani implementation, in version 0.1.0, inflectional and derivational rules regarding verbs, adjectives, adverbs and nouns are implemented. In version 0.1.2, the stem of verbs were provided. This is useful for the stemming task where given a word form, its stem can be retrieved, as in 'ڕن' → 'ڕنیبووم'. Following this, in version 0.1.3 the stem of other part-of-speech tags and the lemma form of the verbs, e.g. 'نواندن' → 'دەنوێنم', were added. Therefore, both the stemming and the lemmatization tasks are now fully operational. In addition, more lexical entries are added, particularly proper names.
  • Regarding the Kurmanji implementation, in version 0.1.0 the structure of the project is created where morphological rules are defined and a dictionary containing over 16000 entries is manually tagged. Kurmanji morphology in comparison to that of Sorani is simpler. This being said, to keep the usage of flags consistent across the project, the same are used in both dialects; for instance, the I (intransitive past stem) and T (transitive past stem) flags are treated equally even though ergativity in Kurmanji is dealt with differently from Sorani. Stems and lemmas are also available for Kurmanji.

Next versions will focus on further enrichments of the current categories and also rectifying possible errors (please report them).

Lexicon annotation

As a rule-based method, Hunspell needs an annotated lexicon to which the morphological rules are applied. To this end, we use the lexicographic material provided by the FreeDict project and Wîkîferheng, the Kurdish Wiktionary. In addition, Wikidata is consulted to extract proper names. The transliteration of the Latin-based script of Kurdish into the Arabic-based one is carried out using Wergor. Each lemma in the lexicon is manually tagged with part-of-speech, its formation type (derivational/inflectional) and further morphological properties. In addition, composing parts of compound forms are specified using a hyphen. This way, the annotated lexicon is also used within the Kurdish Tokenization project.

According to the morphological rules, lemmata in our lexicons are tagged using the following flags. If the flags don't make much sense to you, the part of speech tags, i.e. po flag, will hopefully do as they are provided according to the Universal Dependency tags. The annotated lexicons are available at ckb-Arab.dic and kmr-Latn.dic.

  • N: Noun
  • M: Masculine noun
  • F: Feminine noun
  • V: present stem of verbs
  • I: past stem of intransitive verbs
  • T: past stem of transitive verbs
  • A: adjectives
  • R: adverbs
  • E: numerals
  • C: conjunction
  • D: interjection
  • B: pronouns
  • E: numerals
  • P: adpositions (currently F in Sorani data)
  • G: particle
  • X: infinitive
  • Z: proper names
  • W: irregular cases like were 'come.imp.2s'

The following is an example on how a few lemmata are tagged in the Sorani lexicon:

فەوتێنرا/I po:verb is:past_stem_intransitive_passive
فەوتێنران/XN po:verb is:infinitive_intransitive_passive
فەوتێنرێ/V po:verb is:present_stem_intransitive_passive
فەودە/ZN po:propn
فەڕ/N po:noun
فەڕاشە/N po:noun

and in the Kurmanji lexicon:

reng/M po:noun_masc
rengand/T po:verb is:past_stem_transitive_active st:reng lem:rengandin
rengandin/XN po:verb is:infinitive_transitive_active st:reng lem:rengandin
rengarengkirî/AN po:adj
rengdarbûyî/AN po:adj

Cite this paper

There are two publications regarding this project which should be cited as follows (paper 1, paper 2):

@article{ahmadi2020Hunspell,
	title={{Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis}},
	author={Ahmadi, Sina},
	journal={arXiv preprint arXiv:2109.06374},
	year={2021},
}

@article{ahmadi2020formalization,
	title={{A Formal Description of Sorani Kurdish Morphology}},
	author={Ahmadi, Sina},
	journal={arXiv preprint arXiv:2109.03942},
	year={2021}
}

Contribute

Are you interested in this project? Please follow the instructions of the Kurdish Language Processing Toolkit (KLPT) to get involved. Open-source is fun! 😊

Sponsorship

The current project is the fruit of hundreds of hours of research and development. If this project matters to you, please support me through the Sponsor button on the top of the page. Thanks!

My warmest thanks to datavaluepeople and Build Up for their sponsorship that gave me the motivation to complete adding stems. Version 0.1.3 was made possible thanks to them! ❤️

License

Creative Commons License
This repository by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:

  • You are free to share, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially.
  • You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.