A list of Romanian NLP Datasets

A curated list of open source and open access Romanian Language NLP Datasets. For the moment we don't add parallel copora to the list.

For additions or any other changes please submit a pull request.

Unlabeled text Corpora
Semantic Textual Similarity / Paraphrasing
Natural Language Inference
Summarization
Dialect and regional speech identification
Named Entity Recognition (NER)
Autorship Attribution
Sentiment Analysis
Dependency Parsing
Diacritics Restoration / Grammar Correction
Fake News / Clickbait / Satirical News
Offensive Language
Questions and Answering
Spelling, Dictionaries and Gramatical Errors

Unlabeled text Corpora

❄️FuLG dataset ❄️

The FuLG dataset is a comprehensive Romanian language corpus comprising
150 billion tokens, carefully extracted from Common Crawl.

🌐 Oscar Common Crawl dataset 🌐

Part of a large multilanguage corpus originated from Common Crawl.
It's a raw, unannotated corpus. It has roughly 50 GB of Romanian text
in 4.5 million documnets. For details check its homepage 
and the paper

📚 CC-100 📚

 Similar to Oscar, part of a multilanguage corpus also based on Common Crawl
 from 2018. Romanian text is 16GB large

🌍 Wikipedia Corpus 🌍

  Romanian language wikipedia dump.

📰⚖️ RoTex Collection 📰⚖️

  A collection of varoius unannotated corpora collected around 2018-2019.
  Includes books, scraped newspapers and juridical documents

📖 Romanian Language Repository 📖

  A collection of written and spoken text from various
  sources: Articles, Fairy tales, Fiction, History, Theatre, News

🏛️ MARCELL Legislative Corpus 🏛️

 Romanian national legilation from  1881 to 2021. The corpus
 includes mainly: governmental decisions, ministerial orders,
 decisions, decrees and laws.
 Automatically annotated for Named Entities

🦠 COVID-19 Tweets 🐦

Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. It is available in over 100+ languages, Romanian being one of them. Tweets need to be rehydrated

COVIDSentiRO

A corpus of Romanian tweets related to COVID and vaccination against COVID, created and collected between January 2021 and February 2022. It contains 19319 tweets.

📜 Minutes of the Sittings of the Chamber of Deputies of Romania 📜

Minutes of the Sittings of the Chamber of Deputies of Romania (2016-2018)
Unannotated corpus

🔊 Minutes of the Sittings of the Romanian Parliament 🔊

contains 500k+ instances of speech from the parliament podium from
1996 to 2018. Sentence splitting and deduplication onm sentence level
have been applied as processing steps
Unannotated corpus

🗣️ Romanian Presidential Discourses 🗣️

Romanian presidential discouses (1990-2020) split in 4 files
one for each president. Unannotated corpus

🎭 Culture Domain Corpus 🎭

Monolingual Romanian corpus, including content from public websites related to culture

Law Domain Corpus

Monolingual (ron) corpus, containing 38063991 tokens and 854096 lexical types in the law domain.

Public Administration Domain Corpus

Monolingual Romanian corpus, containing 360833 sentences (9064764 words) in the public administration domain.

New Civil Procedure Code

The New Civil Procedure Code in Romanian (monolingual) comprising 297888 words.

New Criminal Code

The Romanian updated criminal code: text with law content.

Romanian News Articles Dataset

news articles dataset from romanian newssites title, summary and article

Old Newspapers

multi-language corpus from online available news sources. It contains also 43mil words in Romanian language from Twitter, Blogs and Newspapers

ELTeC-Rom

The Romanian novel collection for ELTeC, the European Literary Text Collection Sources: Biblioteca Metropolitana din Bucuresti, Biblioteca Universitara "Mihai Eminescu" din Iasi, Biblioteca Judeteana din Botosani, personal micro-collections uploaded on Zenodo under the following labels: "Hajduks Library"; "RomanianNovel Library"; "CityMysteries Library"; "BibliotecaDHL_Iasi"

RO Business Emails

Public dataset of 1447 manually annotated Romanian business-oriented emails. The corpus is annotated with 5 token-related labels, as well as 5 sequence-related classes

📖RO-Stories📖

The corpus consists of texts written by Romanian authors between 19th century and present, representing stories, short-stories, fairy tales and sketches. The current version contains 19 authors, 1263 full texts and 12516 paragraphs of around 200 words each, preserving paragraphs integrity.

📕ROST📕

A dataset containing 400 Romanian texts written by 10 authors The dataset contains stories, short stories, fairy tales, novels, articles, and sketches written by Ion Creangă, Barbu Ştefănescu Delavrancea, Mihai Eminescu, Nicolae Filimon, Emil Gârleanu, Petre Ispirescu, Mihai Oltean, Emilia Plugaru, Liviu Rebreanu, Ioan Slavici.

🍳Romanian Cooking Recipes🍳

891 Cooking Recipes in Romanian Language

Semantic Textual Similarity / Paraphrasing

RO-STS

Semantic Textual Similarity dataset for the Romanian language RO-STS contains 8,628 sentence pairs with their similarity scores

Romanian Bible Paraphrase Corpus

A paraphprase corpus created from 10 different Romanian language Bible versions. The final dataset contains 904,815 similar records and 218,977 non matching records, totaling 1,123,927

Romanian paraphrase dataset

Around ~100k examples of paraphrases. No clear explanation on how the dataset was built

TaPaCo

A multi-language paraphrase corpus for 73 languages extracted from the Tatoeba database. It has ~ 2000 romanian phrases totaling 941 paraphrase groups.

Natural Language Inference

RONLI

We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels.

~~RO-NLI~~

The repository seems to be just an attempt at starting to build the dataset

Summarization

RO Text Summarization

Around ~72k Full texts and their summary. Source seems to be news websites. No description or explanation available

Dialect and regional speech identification

RoDia

varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments. Around 2800 records labeled with age, gender and type of dialect

MOROCO

MOROCO: The Moldavian and Romanian Dialectal Corpus The MOROCO data set contains Moldavian and Romanian samples of text collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports, tech totaling over 32.000 labeled records

Named Entity Recognition (NER)

Autorship Attribution

ROST

Sentiment Analysis

Dependency Parsing

Diacritics Restoration / Grammar Correction

Fake News / Clickbait / Satirical News

Offensive Language

manually annotated 4,052 comments on a Romanian local news website into one of the following classes: non-offensive, targeted insults, racist, homophobic, and sexist.

FB RO-Offense

4455 organic generated comments from Facebook live broadcasts annotated not binary offensive language detection tasks and for fine-grained offensive language detection

RO-Offense-Sequences

4800 Romanian comments annotated with offensive text spans Offensive span detection

Hate Speech RO

3860 labeled hate speech records

ROFF

Dataset consists of 5000 tweets, from which 924 were labeled as offensive (18.48 %) and 4076 tweets as non-offensive.

CoRoSeOf

The corpus contains 39 245 tweets, annotated by multiple annotators, following the sexist label set of a recent study.

Questions and Answers

🧮 GSM8K RO 🧮

This dataset is just the translation of the gsm8k dataset. GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. There is no information on the quality of the translation

💻 ROCODE 💻

RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models.

Spelling, Dictionaries and Gramatical Errors

Grammar-RO

Synthetic dataset with ~1.9M records. Altered and correct statement as columns

RoAcReL

Romanian Archaisms Regionalisms Lexicon containing ~ 1940 Word definitions

RoRuDi

Romanian Rules for Dialects - 1940 regionalisms, meanings and the region of provenience

AndyTheFactory/romanian-nlp-datasets

A list of Romanian NLP Datasets

Table of contents

Unlabeled text Corpora

Semantic Textual Similarity / Paraphrasing

Natural Language Inference

Summarization

Dialect and regional speech identification

Named Entity Recognition (NER)

Autorship Attribution

Sentiment Analysis

Dependency Parsing

Diacritics Restoration / Grammar Correction

Fake News / Clickbait / Satirical News

Offensive Language

Questions and Answers

Spelling, Dictionaries and Gramatical Errors