/awesome-kyrgyz-nlp

Kyrgyz language processing software, models and datasets.

Primary LanguageShell

Awesome Kyrgyz NLP Awesome

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed to for a long time (2~3 years).

Table of Contents

Datasets

Corpora

  • Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
  • kkWaC: Kyrgyz corpus from the web, 19M words, Jan 2012
  • Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
  • TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)

Character recognition

Raw text

  • kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

  • Polyglot morfessor — pretrained morfessor model, number 6
  • fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
  • compressed fastTextfasttext-ky-mini prepared by Liebl Bernhard in 2021.
  • BERT-based NERbert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
  • Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas

Methods/Software

  • spaCy basic support: tokenization, stopwords, like_num
  • stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
  • kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

  • Kyrgyz NLP bibliography: kyrgyznlp.github.io
  • Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
  • A useful Apertium's list of tools and other resources
  • Online dictionaries and other useful resources on el-sozduk.kg
  • Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
  • Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote: You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.)

Contributions to this list

@golden-ratio