/awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

News

2024/01 -- UNLP 2024 shared task has been announced

1. Datasets / Corpora

Monolingual

  • Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News.
  • Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
  • UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
  • Wikipedia
  • OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
  • mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
  • Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
  • Ukrainian forums — 250k sentences scraped from forums.
  • Ukrainain news headlines — 5.2M news headlines.

Parallel

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

Dictionaries

Prompts

  • Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.

2. Tools

  • tree_stem — stemmer
  • pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
  • LanguageTool — grammar, style and spell checker
  • Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
  • nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
  • NLP-Cube - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.

3. Pretrained models

Language models

Autoregressive:

  • aya-101 — massively multilingual LM, 13B parameters
  • pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
  • UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
  • XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
  • Tereveni-AI/GPT-2
  • uk4b and haloop inference toolkit - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.

Masked:

Mixed:

Machine translation

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences