(last updated on 12/06/2021)
A curated list of awesome resources, tools and scientific papers for Kurdish language technology
Although I do my best to keep this page as comprehensive as possible by including all projects, the list may not include all the fantastic small and big projects regarding Kurdish language processing. Please be kind and notify me by reaching out by email or through our community on Gitter.
Are you interested in contributing to Kurdish language processing? Check out this post to see how you can do so.
- Open Super-large Crawled ALMAnaCH coRpus (OSCAR) (Sorani and Kurmanji)
- Pewan (Sorani and Kurmanji)
- Kurdish folkloric lyrics corpus (Sorani)
- AsoSoft corpus (Sorani)
- Kurdish Textbooks Corpus (Sorani)
- Zaza-Gorani corpus (Zazaki and Gorani)
- Kurdish resources on Clarin
- Ataman's Bianet corpus containing Turkish-English-Kurmanji aligned texts
- Ahmadi et al's corpus containing English-Kurmanji-Sorani aligned texts
- Tanzil: one Qoran translation alignable with many other translations in other languages, including 11 in English (see this project)
- Bible translations in Kurmanji-Latin and Kurmanji-Cyrillic
- TED Talks subtitles
- HLP Colloquial Corpus #1 (Sorani and Kurmanji (Latin and Arabic)) (not free)
- A parallel corpus of Sorani-English text
- FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation (Sorani)
Check out a comprehensive list of Kurdish dictionaries and beware of copyright issues in the following projects:
- Kurdish lexicographical resources in Ontolex-Lemon (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Check Dolan Hêriş's repositories for a list of Kurdish dictionaries and tools to extract words
- KurdNet-the Kurdish wordNet (Sorani)
- Kurdish annotated lexicon (Sorani)
- Freedict word lists (Sorani and Kurmanji)
- Translation Initiative for COVID-19 including Sorani and Kurmanji
- MyMemory dictionaries with an open-access API (Sorani)
- Dataset of Kurdish poems with meter and form tags
- A Twitter dataset (Sorani and Kurmanji)
- Datasets for text to Kurdish Sign Language (Sorani)
- A dataset for speech recognition (Sorani)
- A sentence-segmented dataset (Sorani)
- Evaluation datasets for Kurdish Grapheme-to-Phoneme Conversion systems (Sorani)
- Universal dependency (Kurmanji)
- Wergor transliteration datasets
- Web Inventory of Transcribed and Translated Talks (WIT3) (Sorani)
- Sorani and Kurmanji morphological datasets in UniMorph
- FakeKurdNews, an annotated dataset for Sorani Kurdish fake news detection
- fastText word vectors (Sorani and Kurmanji)
- Polyglot's word embeddings
- Language identifier (Sorani and Kurmanji)
- Wergor for transliteration (Sorani and Kurmanji)
- Kurdish Tokenization
- Jedar stemmer
- Apertium project for Kurmanji and Sorani morphological analysis
- Kurdish Hunspell for Sorani morphological analysig, spell checking, stemming and lemmatization
- Part-of-speech tagger (Sorani)
- Alexina Framework: morphological analysis and POS-tagger for Sorani (
soralex
) and Kurmanji (kurlex
) - Kurdspell for Sorani spell checking
- Apertium rule-based Sorani spell-checker
- Apertium (Sorani and Kurmanji)
- Kurdish MT (Sorani)
- Autoregressive Entity Retrieval (Kurmanji)
- Kurdish Language Processing Toolkit: a natural language processing toolkit in Python
- Kurdînûs: pure JavaScript tools for transliteration, text conversion and normalization
- Kurdish Language Library: converting characters and digits in Persian, English and Arabic to Kurdish and vice versa
- AsoSoft's Library for Kurdish: normalizer, numeral converter, grapheme-to-phoneme convertor in C#
In addition to these, you can find further information in other repositories and pages as follows:
These references are provided based on the data collected in the paper entitled KLPT – Kurdish Language Processing Toolkit. Note that references are provided in the bibliography
file.
Reference | Year | Field | dialects |
---|---|---|---|
esmaili2013sorani |
2013 | Dialectology | Sorani, Kurmanji |
hassani2016automatic |
2016 | Dialectology | Sorani, Kurmanji |
malmasi2016subdialectal |
2016 | Dialectology | Sorani |
al2017kurdish |
2017 | Dialectology | Sorani, Kurmanji, Gorani |
amani:hal-03262435 |
2021 | Dialectology | Kurdish, Zazaki & Gorani |
mohammed2012automatic |
2012 | Information retrieval and Text mining | Sorani |
esmaili2012challenges |
2012 | Information retrieval and Text mining | Sorani |
littell2016named |
2016 | Information retrieval and Text mining | Sorani |
hassani2017method |
2017 | Information retrieval and Text mining | Sorani, Kurmanji |
esmaAl-Talabaniili2014towards |
2014 | Information retrieval and Text mining | Sorani, Kurmanji |
jaf2016simple |
2016 | Information retrieval and Text mining | Sorani |
rashid2017robust |
2017 | Information retrieval and Text mining | Sorani |
rashid2017automatic |
2017 | Information retrieval and Text mining | Sorani |
saeed2018improving |
2018 | Information retrieval and Text mining | Sorani |
saeed2018improving |
2018 | Information retrieval and Text mining | Sorani |
mustafa2018kurdish |
2018 | Information retrieval and Text mining | Sorani |
saeed2018evaluation |
2018 | Information retrieval and Text mining | Sorani |
ahmadi2019wergor |
2019 | Information retrieval and Text mining | Sorani |
mahmudi2021automated |
2021 | Information retrieval and Text mining | Sorani |
esmaili2013building |
2013 | Lexical resources | Sorani |
aliabadi2014towards |
2014 | Lexical resources | Sorani |
aliabadi2014semi |
2014 | Lexical resources | Sorani |
ataman2018bianet |
2018 | Lexical resources | Kurmanji |
ahmadi2019towards |
2019 | Lexical resources | Sorani, Kurmanji, Gorani |
abdulrahman2019developing |
2019 | Lexical resources | Sorani |
abdulrahman2020using |
2020 | Lexical resources | Sorani |
veisi2020toward |
2020 | Lexical resources | Sorani |
ahmadi2020corpus |
2020 | Lexical resources | Sorani |
ahmadi-2020-building |
2020 | Lexical resources | Zaza, Gorani |
ahmadi2020leveraging |
2020 | Lexical resources | Sorani |
veisi2021jira |
2021 | Lexical resources | Sorani |
hassani2017kurdish |
2017 | Machine Translation | Sorani, Kurmanji |
kaka2018english |
2018 | Machine Translation | Sorani |
ahmadi2020machine |
2020 | Machine Translation | Sorani |
goyal2021flores |
2021 | Machine Translation | 101 languages incl. Sorani |
amini2021central |
2021 | Machine Translation | Sorani |
baban1995programmable |
1995 | Morphological and syntactic analysis | Sorani |
walther2010developing |
2010 | Morphological and syntactic analysis | Sorani |
walther2010fast |
2010 | Morphological and syntactic analysis | Kurmanji |
salavati2013stemming |
2013 | Morphological and syntactic analysis | Sorani |
jaf2014stemmer |
2014 | Morphological and syntactic analysis | Sorani |
jaf2016chapter |
2016 | Morphological and syntactic analysis | Sorani |
gokirmak2017dependency |
2017 | Morphological and syntactic analysis | Kurmanji |
salavati2018building |
2018 | Morphological and syntactic analysis | Sorani |
mustafa2018kurdish |
2018 | Morphological and syntactic analysis | Sorani |
ahmadi2020towards |
2020 | Morphological and syntactic analysis | Sorani |
ahmadi-2020-tokenization |
2020 | Morphological and syntactic analysis | Sorani, Kurmanji |
mohammed2012uniqueness |
2012 | Optical character recognition | Sorani |
mohammed2013handwritten |
2013 | Optical character recognition | Sorani |
shaltookisentiment |
2016 | Optical character recognition | Sorani |
zarro2017recognition |
2017 | Optical character recognition | Sorani |
yaseen2018kurdish |
2018 | Optical character recognition | Sorani |
dinler2018kurdish |
2018 | Optical character recognition | Sorani |
kaka2017building |
2017 | Other | Sorani |
mahmudi2021automatic |
2021 | Other | Sorani |
hashim2018kurdish |
2018 | Sign language recognition | Sorani |
kamal-hassani-2020-towards |
2020 | Sign language recognition | Sorani |
daneshfar2009implementation |
2009 | Speech recognition | Sorani |
barkhoda2009comparison |
2009 | Speech recognition | Sorani |
bahrampour2009implementation |
2009 | Speech recognition | Sorani |
hassani2011kurdish |
2011 | Speech recognition | Sorani |
dinler2017formant |
2017 | Speech recognition | Kurmanji |
dinler2018extraction |
2018 | Speech recognition | Sorani, Kurmanji |
qader2019kurdish |
2019 | Speech recognition | Sorani |
ahmadi-2020-klpt |
2020 | Toolkits | Sorani, Kurmanji |
de2021multilingual |
2021 | Named-entity recognition | Kurmanji |
If you find the provided data useful for your project, feel free to use it and please, cite the following paper, too:
@inproceedings{ahmadi-2020-klpt,
title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
author = "Ahmadi, Sina",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
doi = "10.18653/v1/2020.nlposs-1.11",
pages = "72--84"
}