/corus

Links to Russian corpora + Python functions for loading and parsing

Primary LanguageJupyter NotebookMIT LicenseMIT

CI

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset API from corus import Tags Texts Uncompressed Description
Lenta.ru
Lenta.ru v1.0 load_lenta # news 739 351 1.66 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Lenta.ru v1.1+ load_lenta2 # news 800 975 1.94 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
Lib.rus.ec load_librusec # fiction 301 871 144.92 Gb Dump of lib.rus.ec prepared for RUSSE workshop

wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
Rossiya Segodnya load_ria_raw #
load_ria #
news 1 003 869 3.70 Gb wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
Mokoron Russian Twitter Corpus load_mokoron # social sentiment 17 633 417 1.86 Gb Russian Twitter sentiment markup

Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia load_wiki # 1 541 401 12.94 Gb Russian Wiki dump

wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
GramEval2020 load_gramru # 162 372 30.04 Mb wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
OpenCorpora load_corpora # morph 4 030 20.21 Mb wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
RusVectores SimLex-965 load_simlex # emb sim wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
Omnia Russica load_omnia # morph web fiction 489.62 Gb Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf

Manually download http://bit.ly/2ZT4BY9
factRuEval-2016 load_factru # ner news 254 969.27 Kb Manual PER, LOC, ORG markup prepared for 2016 Dialog competition

wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
Gareev load_gareev # ner news 97 455.02 Kb Manual PER, ORG markup (no LOC)

Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset
tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
Collection5 load_ne5 # ner news 1 000 2.96 Mb News articles with manual PER, LOC, ORG markup

wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
WiNER load_wikiner # ner 203 287 36.15 Mb Sentences from Wiki auto annotated with PER, LOC, ORG tags

wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
BSNLP-2019 load_bsnlp # ner 464 1.16 Mb Markup prepared for 2019 BSNLP Shared Task

wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
Persons-1000 load_persons # ner news 1 000 2.96 Mb Same as Collection5, only PER markup + normalized names

wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
The Russian Drug Reaction Corpus (RuDReC) load_rudrec # ner 4 809 1.73 Kb RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
Taiga Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks

wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
Arzamas load_taiga_arzamas # news 311 4.50 Mb
Fontanka load_taiga_fontanka # news 342 683 786.23 Mb
Interfax load_taiga_interfax # news 46 429 77.55 Mb
KP load_taiga_kp # news 45 503 61.79 Mb
Lenta load_taiga_lenta # news 36 446 95.15 Mb
Taiga/N+1 load_taiga_nplus1 # news 7 696 24.96 Mb
Magazines load_taiga_magazines # 39 890 2.19 Gb
Subtitles load_taiga_subtitles # 19 011 909.08 Mb
Social load_taiga_social # social 1 876 442 648.18 Mb
Proza load_taiga_proza # fiction 1 732 434 38.25 Gb
Stihi load_taiga_stihi # 9 157 686 12.80 Gb
Russian NLP Datasets Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News load_buriy_news # news 2 154 801 6.84 Gb Dump of top 40 news + 20 fashion news sites.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
Webhose load_buriy_webhose # news 285 965 859.32 Mb Dump from webhose.io, 300 sources for one month.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
ODS #proj_news_viz Several news sites scraped by members of #proj_news_viz ODS project.
Interfax load_ods_interfax # news 543 961 1.22 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
Gazeta load_ods_gazeta # news 865 847 1.63 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
Izvestia load_ods_izvestia # news 86 601 307.19 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
Meduza load_ods_meduza # news 71 806 270.11 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
RIA load_ods_ria # news 101 543 233.88 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
Russia Today load_ods_rt # news 106 644 187.12 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
TASS load_ods_tass # news 1 135 635 3.27 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
Universal Dependencies
GSD load_ud_gsd # morph syntax 5 030 1.01 Mb wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
Taiga load_ud_taiga # morph syntax 3 264 353.80 Kb wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
PUD load_ud_pud # morph syntax 1 000 207.78 Kb wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
SynTagRus load_ud_syntag # morph syntax 61 889 11.33 Mb wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
morphoRuEval-2017
General Internet-Corpus load_morphoru_gicrya # morph 83 148 10.58 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
Russian National Corpus load_morphoru_rnc # morph 98 892 12.71 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
OpenCorpora load_morphoru_corpora # morph 38 510 4.80 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs load_russe_hj # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
RT: Synonyms and Hypernyms from the Thesaurus RuThes load_russe_rt # emb sim wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
AE: Cognitive Associations from the Sociation.org Experiment load_russe_ae # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC) load_toloka_lrwc # emb sim wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) load_ruadrect # social 9 515 2.09 Mb This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip

Support

Add new source

  1. Implement corus/sources/<source>.py
  2. Add import into corus/sources/__init__.py
  3. Add meta into corus/source/meta.py
  4. Add example into docs.ipynb (check meta table is correct)
  5. Run tests (readme is updated)

Development

Dev env

python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + update docs

make lint
make exec-docs

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags