A curated list of awesome Azerbaijani language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed for long time (2~3 years).
- University of Leipzig corpus collection — Newscrawl (2011, 2013) and Wikipedia (misc) datasets
- Helsinki University corpus — New Testament in the Azerbaijani language
- Latest azwiki dump: download directly
- Azeri at An Crúbadán — 8M+ words, Latin script
- az-corpus-nlp — 2000+ texts, Latin script
- azWaC: Azerbaijani corpus from the web — SketchEngine-hosted corpus crawled from the web in 2012, ~94 million words
- Domrachev-Sudoplatova scraped corpus — 2189398 words, 100560 sentences
- Azerbaijani Named Entity Recognition (NER) Dataset — A dataset for training and evaluating NER models in Azerbaijani, including annotated text data with various named entities.
Several corpora are also mentioned in research works:
- S. Mammadova, G. Azimova, and A. Fatullayev. 2010.Text corpora and its role in development of the linguistic technologies for the azerbaijani language. In The Third International Conference Problems of Cybernetics and Informatics.
- Baisa, Vıt, and Vıt Suchomel. "Large corpora for turkic languages and unsupervised morphological analysis." Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). 2012. [SketchEngine corpora?]
- C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. Computational linguistics and intelligent text processing, pages 217–228.
- Domrachev M. A., Sudoplatova S. N. Testing Methods for Automatic Detection of Mor- pheme Boundaries in the Azerbaijani Language. Vestnik NSU. Series: Linguistics and Intercultural Communication , 2018, vol. 16, no. 2, p. 34–47. (in Russ.) Downloadable corpus
- Özenç B., Ehsani R., Solak E. Moraz: an open-source morphological analyzer for Azerbaijani Turkish //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. – 2018. – С. 25-29. [BBC Azerbaijan]
- UD_Azerbaijani-TueCL: a treebank that contains a total of ~110 sentences including 20 Cairo sentences, and ~90 sentences suggested by UD Turkic Group; part of the UD Turkic Treebank. Translations of all the sentences are available in English, Turkish and Kyrgyz languages
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Azeri is hard as well
TODO
- AZ summarization — articles and titles, available on request
- AZ-EN parallel corpus — 68K+ sentences, available on request
Mentioned in:
- N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — 2194+700 tweets
- Mammad Hajili's 160K customer reviews with scores and upvotes
- Polyglot morfessor — pretrained morfessor model, number 53
- fastText — 300-dimensional fastText vectors provided by the authors
- Azmorph — morphological analyzer for Azerbaijani (Azerbaycan dili), said to be in pre-ALPHA state; however, was used for web corpora preparation
- Wiktionary word forms extraction — Python code on github
- MorAz — open-source morph. analyzer, paper, demo, related slides on AZ morphology.
Mentioned in papers:
- POS-tagging paper — Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018, October). Part-of-Speech Tagging for Azerbaijani Language. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-6). IEEE. [Probable implementation: aznlp repo]
- Stemming paper, 2019 — Alizadeh, M. B. H., & Seyyedi, S. A. H. (2019). AUTO STEMMING OF AZERBAIJANI LANGUAGE. Problems of Information Technology, 59-66.
- N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — Zemberek is extended for Azerbaijani; though stated a lot of effort is still required for it to work properly for Azeri language.
- TODO
- Cyrillic ⇄ Latin conversion — PHP-based online tool
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
- Azeribaijani corpora data review
- Dilmanc — government-funded Azerbaijani language-related initiative
- Dilmanc EAMT paper on MT peculiarities
- Apertium page — a list of various online language-related resources
- AZNLP github — a repo hub with various language-related software: stemmer, POS-tagger
- MozillaAZ community spellchecker — spellchecker plugin