awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

News

2024/01 -- UNLP 2024 shared task has been announced

1. Datasets / Corpora

Monolingual

Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News.
Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
Wikipedia
OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
Ukrainian forums — 250k sentences scraped from forums.
Ukrainain news headlines — 5.2M news headlines.

Parallel

OPUS
Tatoeba MT Challenge data sets
Polish-Ukrainian Parallel Corpus
Back-translated monolingual Wiki data
Wiki Edits — 5M sentence edits extracted from the Ukrainian Wikipedia revision history.

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

ZNO — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO).
UA-GEC — grammatical error correction (GEC) and fluency corpus.
NER-uk — Brown-UK labeled for named entities.
Yakaboo Book Reviews — book reviews, ratings and descriptions.
Universal Dependencies — dependency trees corpus.
ua-news — 150k news article in 5 categories.
UA-SQuAD — Ukrainian version of Stanford Question Answering Dataset.
Ukrainian Winograd schema challenge (WSC) Dataset — manually translated.
Ukrainian OntoNotes Dataset — scripts to build large silver dataset for coreference resolution.

Dictionaries

ВЕСУМ — POS tag dictionary. Can generate a list of all word forms valid for spelling.
Tonal dictionary
Multilingualsentiment, includes Ukrainian - a list of positive/negative words
obscene-ukr — profanity dictionary
Word stress dictionary — word stress for 2.7M word forms. See ukrainian-word-stress
Heteronyms — words that share the same spelling but have different meaning/pronunciation.
Abbreviations — map abbreviation to expansion

Prompts

Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.

2. Tools

tree_stem — stemmer
pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
LanguageTool — grammar, style and spell checker
Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
NLP-Cube - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.

3. Pretrained models

Language models

Autoregressive:

aya-101 — massively multilingual LM, 13B parameters
pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
Tereveni-AI/GPT-2
uk4b and haloop inference toolkit - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.

Masked:

xlm-roberta-base-uk — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left.
youscan/ukr-roberta-base

Mixed:

Electra

Machine translation

Helsinki-NLP / OPUS-MT models — Ukrainian to/from 25 langaguages.
- OPUS-MT models at HuggingFace
- OPUS-MT models evaluated on flores101
M2M-100 — Ukrainian to/from 100 languages.

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

mBART50
mT5

Named-entity recognition (NER)

Part-of-speech tagging (POS)

lang-uk/flair-uk-pos

Word embeddings

fastText
- Official fastText trained on CommonCrawl and Wiki — 157 languages, including Ukrainian.
- Older official fastText trained on Wiki — 294 languages, including Ukrainian.
- fastText_multilingual — 78 languages, aligned to the same vector space.
- fasttext_uk (2023) and cbow — trained on UberText 2.0
Word2Vec
GloVe
LexVec
BPEmb: Subword Embeddings, includes Ukrainian - easy to use with Flair
Flair — Ukrainian added in 2022.

Other

uk-punctcase — punctuation and case restoration model based on XLM-RoBERTa-Uk.
punctuation_uk_bert — another punctation and case restoration model based on bert-base-multilingual-cased.
ukrainian-word-stress — adds word stress.

4. Paid

LORELEI Ukrainian Representative Language Pack - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities

5. Other resources and links

Helsinki-NLP/ UkrainianLT — another collection of links to Ukrainian language tools.
egorsmkv / speech-recognition-uk — speech recognition and text-to-speech models and datasets

6. Workshops and conferences

Ukrainian Natural Language Processing Workshop
UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian
- Training data and evaluation scripts
- Public leaderboard
UNLP 2024 shared task — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian

osyvokon/awesome-ukrainian-nlp