Awesome Azeri NLP

A curated list of awesome Azerbaijani language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed for long time (2~3 years).

Awesome Azeri NLP

Datasets

Raw text

University of Leipzig corpus collection — Newscrawl (2011, 2013) and Wikipedia (misc) datasets
Helsinki University corpus — New Testament in the Azerbaijani language
Latest azwiki dump: download directly
Azeri at An Crúbadán — 8M+ words, Latin script
az-corpus-nlp — 2000+ texts, Latin script
azWaC: Azerbaijani corpus from the web — SketchEngine-hosted corpus crawled from the web in 2012, ~94 million words
Domrachev-Sudoplatova scraped corpus — 2189398 words, 100560 sentences
Azerbaijani Named Entity Recognition (NER) Dataset — A dataset for training and evaluating NER models in Azerbaijani, including annotated text data with various named entities.

Several corpora are also mentioned in research works:

S. Mammadova, G. Azimova, and A. Fatullayev. 2010.Text corpora and its role in development of the linguistic technologies for the azerbaijani language. In The Third International Conference Problems of Cybernetics and Informatics.
Baisa, Vıt, and Vıt Suchomel. "Large corpora for turkic languages and unsupervised morphological analysis." Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). 2012. [SketchEngine corpora?]
C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. Computational linguistics and intelligent text processing, pages 217–228.
Domrachev M. A., Sudoplatova S. N. Testing Methods for Automatic Detection of Mor- pheme Boundaries in the Azerbaijani Language. Vestnik NSU. Series: Linguistics and Intercultural Communication , 2018, vol. 16, no. 2, p. 34–47. (in Russ.) Downloadable corpus
Özenç B., Ehsani R., Solak E. Moraz: an open-source morphological analyzer for Azerbaijani Turkish //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. – 2018. – С. 25-29. [BBC Azerbaijan]

Syntax

UD_Azerbaijani-TueCL: a treebank that contains a total of ~110 sentences including 20 Cairo sentences, and ~90 sentences suggested by UD Turkic Group; part of the UD Turkic Treebank. Translations of all the sentences are available in English, Turkish and Kyrgyz languages
UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Azeri is hard as well

Machine-readable dictionaries

TODO

Summarization

AZ summarization — articles and titles, available on request

Translation

AZ-EN parallel corpus — 68K+ sentences, available on request

Sentiment

Mentioned in:

N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — 2194+700 tweets
Mammad Hajili's 160K customer reviews with scores and upvotes

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 53
fastText — 300-dimensional fastText vectors provided by the authors

Methods/Software

Morphology

Azmorph — morphological analyzer for Azerbaijani (Azerbaycan dili), said to be in pre-ALPHA state; however, was used for web corpora preparation
Wiktionary word forms extraction — Python code on github
MorAz — open-source morph. analyzer, paper, demo, related slides on AZ morphology.

Mentioned in papers:

POS-tagging paper — Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018, October). Part-of-Speech Tagging for Azerbaijani Language. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-6). IEEE. [Probable implementation: aznlp repo]
Stemming paper, 2019 — Alizadeh, M. B. H., & Seyyedi, S. A. H. (2019). AUTO STEMMING OF AZERBAIJANI LANGUAGE. Problems of Information Technology, 59-66.
N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — Zemberek is extended for Azerbaijani; though stated a lot of effort is still required for it to work properly for Azeri language.

Syntax

TODO

Online Demos

Cyrillic ⇄ Latin conversion — PHP-based online tool

Miscellaneous

Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
Azeribaijani corpora data review
Dilmanc — government-funded Azerbaijani language-related initiative
Dilmanc EAMT paper on MT peculiarities
Apertium page — a list of various online language-related resources
AZNLP github — a repo hub with various language-related software: stemmer, POS-tagger
MozillaAZ community spellchecker — spellchecker plugin

alexeyev/awesome-azeri-nlp

Awesome Azeri NLP

Table of Contents

Datasets

Raw text

Syntax

Machine-readable dictionaries

Summarization

Translation

Sentiment

Pretrained models

Methods/Software

Morphology

Syntax

Online Demos

Miscellaneous