A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
-
PolBert - Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with huggingface/Transformers
-
Polish Roberta Model - model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus
-
Allegro BERT - It has not been publish yet (12.2019) - but there is a poster - https://conference.mlinpl.org/pdf/CfC_AllPosters.pdf
-
SlavicBert - multilingual BERT model -BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model http://docs.deeppavlov.ai/en/master/features/models/bert.html but I have problems to convert it to pytorch.
-
ELMO embeddings - A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
-
Zalando Flair polish models - Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
-
Wrocław University of Science and Technology Word2Vec - Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia).
-
FastText polish model FB - train on: Common Crawl, Wikipedia
-
Universal Sentence Encoder Multilingual - sentence embeddings, it covers 16 languages (including Polish)
-
BPEmb: Subword Embeddings includes polish - easy to use with Flair
-
Morfologik (Java) and pyMorfologik (Python wrapper) - dictionary-based morphological analyzer
-
Morfeusz - morphological analyzer. See also Elasticsearch plugin
-
Stempel (Python port) - algorithmic stemmer. See also Elasticsearch plugin
-
scaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.
-
KRNNT Polish morphological tagger - KRNNT is a morphological tagger for Polish based on recurrent neural networks Paper
-
Stanza (Python) - NLP analysis package from Stanford University
-
A curated list of Polish abbreviations for NLTK sentence tokenizer based on Wikipedia text
- Github Repo with list of polish: word embeddings and language models (Word2vec, fasttext, Glove, Elmo) - https://github.com/sdadas/polish-nlp-resources
- Polish Word Embeddings Review - Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
- Polish Sentence Evaluation- contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
- TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE - complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish
- The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.
- PolEval datasets -
- Hate speech classification -distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful information (class: 1) [PolEval 2019 Task6] [mirror GDrive]
- Polish CDSCorpus - The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment.
- Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) - corpus of Polish reviews annotated with sentiment at the level of the whole text (text) and at the level of sentences (sentence) for the following domains: hotels, medicine, products and university (reviews*)
- Ermlab Opineo dataset- opineo reviews - GDrive
- HateSpeech corpus contains over 2000 posts crawled from public Polish web.http://zil.ipipan.waw.pl/HateSpeech
- Polish analogy dataset - example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
- NKJP - National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for download (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus.
- PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
-
OSCAR or Open Super-large Crawled ALMAnaCH coRpus - is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.
-
Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
-
Opus - the open parallel corpus - you can select languages and download only polish file
- Polish OpenSubtitles v2018 - sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from opensubtitles raw txt corpus (unpacked 7.2GB) tokenized txt corpus (unpacked 7.6GB).
- ParaCrawl v5 sentences 6.4M, polish tokens 157.1M raw txt corpus (unpacked 1.1GB) tokenized txt corpus
-
Polish Parliamentary Corpus text from proceedings of Polish Parliament, Sejm and Senate
- "Evaluation of Sentence Representations in Polish" - Sławomir Dadas, Michał Perełkiewicz, Rafał Poswiata 2019 https://arxiv.org/pdf/1910.11834.pdf
- "Multi-level analysis and recognition of the text sentiment on the example of consumer opinions" - Kocoń Jan, Zaśko-Zielińska Monika, Miłkowski Piotr, 2019