Awesome Kyrgyz NLP

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed to for a long time (2~3 years).

Awesome Kyrgyz NLP

Datasets

Corpora

Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
kkWaC: Kyrgyz corpus from the web, 19M words, Jan 2012
Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)

Character recognition

Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80213) have been transformed to 50x50 images, then to CSV format

Raw text

kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff

Named Entity Recognition

WikiANN has a Kyrgyz language part
KyrgyzNER: [not published yet]

Text Classification

Kyrgyz Multi-Label News Classification: [not published yet]

Word Similarity Data

Kyrgyz Word Embedding Evaluation: [not published yet]

Instructions

Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate

Machine-readable dictionaries

Country names table: Kyrgyz-Russian-English
Thesaurus KyrSpell (however, unpacking it seems to break the license)
Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 6
fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
BERT-based NER — bert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas

Methods/Software

spaCy basic support: tokenization, stopwords, like_num
stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh. A much, much easier way: import apertium; apertium.installer.install_module("kir").
[DEPRECATED] kymopl: Kyrgyz morphology in Prolog

Hate Speech detection

Jupyter Notebook for hate speech detection

Other

Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
Number-to-words conversion (JavaScript) by @AzamatSooldaev
Number-to-words conversion (TypeScript) by @timursaurus
Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz

Online Demos

Cyrillic-to-Latin online converter based on this resource.

Miscellaneous

Kyrgyz NLP bibliography: kyrgyznlp.github.io
Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
A useful Apertium's list of tools and other resources
Online dictionaries and other useful resources on el-sozduk.kg
Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote: You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.)

Contributions to this list

@golden-ratio

alexeyev/awesome-kyrgyz-nlp