/awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

News

2024/01 -- UNLP 2024 shared task has been announced

1. Datasets / Corpora

Monolingual

  • Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
  • UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
  • Wikipedia
  • OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
  • mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
  • Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
  • Ukrainian forums — 250k sentences scraped from forums.
  • Ukrainain news headlines — 5.2M news headlines.

Parallel

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

Dictionaries

2. Tools

  • tree_stem — stemmer
  • pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
  • LanguageTool — grammar, style and spell checker
  • Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
  • nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation

3. Pretrained models

Language models

Masked:

Autoregressive:

  • pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
  • UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
  • XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
  • Tereveni-AI/GPT-2

Mixed:

Machine translation

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences