/awesome-azeri-nlp

Azerbaijani language processing software, models and datasets.

Primary LanguageShell

Awesome Azeri NLP Awesome

A curated list of awesome Azerbaijani language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed for long time (2~3 years).

Table of Contents

Datasets

Raw text

Several corpora are also mentioned in research works:

  • S. Mammadova, G. Azimova, and A. Fatullayev. 2010.Text corpora and its role in development of the linguistic technologies for the azerbaijani language. In The Third International Conference Problems of Cybernetics and Informatics.
  • Baisa, Vıt, and Vıt Suchomel. "Large corpora for turkic languages and unsupervised morphological analysis." Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). 2012. [SketchEngine corpora?]
  • C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. Computational linguistics and intelligent text processing, pages 217–228.
  • Domrachev M. A., Sudoplatova S. N. Testing Methods for Automatic Detection of Mor- pheme Boundaries in the Azerbaijani Language. Vestnik NSU. Series: Linguistics and Intercultural Communication , 2018, vol. 16, no. 2, p. 34–47. (in Russ.) Downloadable corpus
  • Özenç B., Ehsani R., Solak E. Moraz: an open-source morphological analyzer for Azerbaijani Turkish //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. – 2018. – С. 25-29. [BBC Azerbaijan]

Syntax

  • UD_Azerbaijani-TueCL: a treebank that contains a total of ~110 sentences including 20 Cairo sentences, and ~90 sentences suggested by UD Turkic Group; part of the UD Turkic Treebank. Translations of all the sentences are available in English, Turkish and Kyrgyz languages
  • UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Azeri is hard as well

Machine-readable dictionaries

TODO

Summarization

Translation

Sentiment

Mentioned in:

Pretrained models

Methods/Software

Morphology

Mentioned in papers:

  • POS-tagging paper — Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018, October). Part-of-Speech Tagging for Azerbaijani Language. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-6). IEEE. [Probable implementation: aznlp repo]
  • Stemming paper, 2019 — Alizadeh, M. B. H., & Seyyedi, S. A. H. (2019). AUTO STEMMING OF AZERBAIJANI LANGUAGE. Problems of Information Technology, 59-66.
  • N. Gasimli's MS thesis "Analysis of the use of Twitter in Azerbaijan" — Zemberek is extended for Azerbaijani; though stated a lot of effort is still required for it to work properly for Azeri language.

Syntax

  • TODO

Online Demos

Miscellaneous