corus: A Jupyter Notebook repository from iashchak

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru
Lenta.ru v1.0	`load_lenta` `#`	`news`	739 351	1.66 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lenta.ru v1.1+	`load_lenta2` `#`	`news`	800 975	1.94 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2`
Lib.rus.ec	`load_librusec` `#`	`fiction`	301 871	144.92 Gb	Dump of lib.rus.ec prepared for RUSSE workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `#` `load_ria` `#`	`news`	1 003 869	3.70 Gb	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
Mokoron Russian Twitter Corpus	`load_mokoron` `#`	`social` `sentiment`	17 633 417	1.86 Gb	Russian Twitter sentiment markup Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki` `#`		1 541 401	12.94 Gb	Russian Wiki dump `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
GramEval2020	`load_gramru` `#`		162 372	30.04 Mb	`wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip` `unzip master.zip` `mv GramEval2020-master/dataTrain train` `mv GramEval2020-master/dataOpenTest dev` `rm -r master.zip GramEval2020-master` `wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu`
OpenCorpora	`load_corpora` `#`	`morph`	4 030	20.21 Mb	`wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip`
RusVectores SimLex-965	`load_simlex` `#`	`emb` `sim`			`wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv` `wget https://rusvectores.org/static/testsets/ru_simlex965.tsv`
Omnia Russica	`load_omnia` `#`	`morph` `web` `fiction`		489.62 Gb	Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf Manually download http://bit.ly/2ZT4BY9
factRuEval-2016	`load_factru` `#`	`ner` `news`	254	969.27 Kb	Manual PER, LOC, ORG markup prepared for 2016 Dialog competition `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev` `#`	`ner` `news`	97	455.02 Kb	Manual PER, ORG markup (no LOC) Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5` `#`	`ner` `news`	1 000	2.96 Mb	News articles with manual PER, LOC, ORG markup `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
WiNER	`load_wikiner` `#`	`ner`	203 287	36.15 Mb	Sentences from Wiki auto annotated with PER, LOC, ORG tags `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
BSNLP-2019	`load_bsnlp` `#`	`ner`	464	1.16 Mb	Markup prepared for 2019 BSNLP Shared Task `wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip` `wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip` `unzip TRAININGDATA_BSNLP_2019_shared_task.zip` `unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg` `rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip`
Persons-1000	`load_persons` `#`	`ner` `news`	1 000	2.96 Mb	Same as Collection5, only PER markup + normalized names `wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip`
The Russian Drug Reaction Corpus (RuDReC)	`load_rudrec` `#`	`ner`	4 809	1.73 Kb	RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC. `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json`
Taiga	Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas` `#`	`news`	311	4.50 Mb
Fontanka	`load_taiga_fontanka` `#`	`news`	342 683	786.23 Mb
Interfax	`load_taiga_interfax` `#`	`news`	46 429	77.55 Mb
KP	`load_taiga_kp` `#`	`news`	45 503	61.79 Mb
Lenta	`load_taiga_lenta` `#`	`news`	36 446	95.15 Mb
Taiga/N+1	`load_taiga_nplus1` `#`	`news`	7 696	24.96 Mb
Magazines	`load_taiga_magazines` `#`		39 890	2.19 Gb
Subtitles	`load_taiga_subtitles` `#`		19 011	909.08 Mb
Social	`load_taiga_social` `#`	`social`	1 876 442	648.18 Mb
Proza	`load_taiga_proza` `#`	`fiction`	1 732 434	38.25 Gb
Stihi	`load_taiga_stihi` `#`		9 157 686	12.80 Gb
Russian NLP Datasets	Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News	`load_buriy_news` `#`	`news`	2 154 801	6.84 Gb	Dump of top 40 news + 20 fashion news sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose` `#`	`news`	285 965	859.32 Mb	Dump from webhose.io, 300 sources for one month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2`
ODS #proj_news_viz	Several news sites scraped by members of #proj_news_viz ODS project.
Interfax	`load_ods_interfax` `#`	`news`	543 961	1.22 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz`
Gazeta	`load_ods_gazeta` `#`	`news`	865 847	1.63 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz`
Izvestia	`load_ods_izvestia` `#`	`news`	86 601	307.19 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz`
Meduza	`load_ods_meduza` `#`	`news`	71 806	270.11 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz`
RIA	`load_ods_ria` `#`	`news`	101 543	233.88 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz`
Russia Today	`load_ods_rt` `#`	`news`	106 644	187.12 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz`
TASS	`load_ods_tass` `#`	`news`	1 135 635	3.27 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz`
Universal Dependencies
GSD	`load_ud_gsd` `#`	`morph` `syntax`	5 030	1.01 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu`
Taiga	`load_ud_taiga` `#`	`morph` `syntax`	3 264	353.80 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu`
PUD	`load_ud_pud` `#`	`morph` `syntax`	1 000	207.78 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu`
SynTagRus	`load_ud_syntag` `#`	`morph` `syntax`	61 889	11.33 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu`
morphoRuEval-2017
General Internet-Corpus	`load_morphoru_gicrya` `#`	`morph`	83 148	10.58 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip` `unzip GIKRYA_texts_new.zip` `rm GIKRYA_texts_new.zip`
Russian National Corpus	`load_morphoru_rnc` `#`	`morph`	98 892	12.71 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar` `unrar x RNC_texts.rar` `rm RNC_texts.rar`
OpenCorpora	`load_morphoru_corpora` `#`	`morph`	38 510	4.80 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar` `unrar x OpenCorpora_Texts.rar` `rm OpenCorpora_Texts.rar`
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs	`load_russe_hj` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv`
RT: Synonyms and Hypernyms from the Thesaurus RuThes	`load_russe_rt` `#`	`emb` `sim`			`wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv`
AE: Cognitive Associations from the Sociation.org Experiment	`load_russe_ae` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv` `wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv` `wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv`
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC)	`load_toloka_lrwc` `#`	`emb` `sim`			`wget https://tlk.s3.yandex.net/dataset/LRWC.zip` `unzip LRWC.zip` `rm LRWC.zip`
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)	`load_ruadrect` `#`	`social`	9 515	2.09 Mb	This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip` `unzip RuADReCT.zip` `rm RuADReCT.zip`

Support

Chat — https://t.me/natural_language_processing
Issues — https://github.com/natasha/corus/issues
Commercial support — https://lab.alexkuk.ru

Add new source

Implement corus/sources/<source>.py
Add import into corus/sources/__init__.py
Add meta into corus/source/meta.py
Add example into docs.ipynb (check meta table is correct)
Run tests (readme is updated)

Development

Dev env

python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + update docs

make lint
make exec-docs

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags

iashchak/corus