ReadME RoTex Corpus Builder

Description

This project aims to make available, in an open and transparent way, a high quality corpus of Romanian plain texts that can be used for NLP. The transformations on original source are as non invasive as possible, for example corrections are applied for cedilla diacritics, line stitching for some PDF files that breaks the paragraphs too much, but otherwise the text is kept in the original form. Some texts are removed, for example the repetitive text from header, footer of PDF files, page numbers, etc. Given the non-structured nature of PDF files the results may vary.

Warning: Please note that not all sources are public domain and securing usage rights might be necessary in some cases.

Sources

Sources ordered by total word count in document:

Source	Word Count (¹)	Types Count (²)	DEX Coverage (³)	Uncompressed size	Compressed size
artapolitica ➭	716.033 (691.224)	58.262	3,95% (48.418)	4 MB	1 MB ▼Download
biblior ➭	1.181.820 (1.142.479)	89.642	5,9% (72.298)	7 MB	2 MB ▼Download
uzp ➭	1.553.850 (1.415.284)	111.147	6,5% (79.622)	10 MB	4 MB ▼Download
carti-bune-gratis ➭	1.619.354 (1.587.329)	76.103	5,54% (67.833)	9 MB	3 MB ▼Download
historica-cluj ➭	2.542.106 (2.215.101)	150.298	6,55% (80.157)	18 MB	6 MB ▼Download
destine-literale ➭	4.325.392 (3.686.367)	270.410	11,29% (138.233)	27 MB	11 MB ▼Download
certitudinea ➭	4.338.169 (3.846.371)	117.207	6,95% (85.152)	28 MB	11 MB ▼Download
paul-goma ➭	6.536.053 (6.111.977)	254.228	10,82% (132.469)	41 MB	16 MB ▼Download
rudolf-steiner ➭	7.678.761 (6.721.026)	106.878	5,52% (67.549)	50 MB	15 MB ▼Download
litera-net ➭	8.591.552 (8.211.512)	263.844	14,88% (182.250)	54 MB	21 MB ▼Download
napoca-news ➭	12.376.780 (11.011.431)	297.076	13,07% (159.974)	83 MB	32 MB ▼Download
biblioteca-digitala-ase ➭	16.105.049 (15.107.692)	256.196	10,73% (131.383)	121 MB	37 MB ▼Download
jrq-aquis ➭	17.934.242 (15.007.550)	294.193	7,62% (93.247)	140 MB	44 MB ▼Download
biblioteca-pe-mobil ➭	19.299.099 (17.385.248)	419.782	17,09% (209.309)	116 MB	44 MB ▼Download
ziarul-lumina ➭	23.693.901 (20.548.062)	271.607	13,17% (161.249)	168 MB	59 MB ▼Download
gazeta-de-cluj ➭	25.772.022 (24.185.518)	320.891	14,09% (172.503)	171 MB	59 MB ▼Download
bestseller-md ➭	27.766.289 (26.687.128)	348.555	18,01% (220.517)	171 MB	63 MB ▼Download
archive-org ➭	32.418.839 (30.728.463)	761.252	24,58% (300.945)	210 MB	77 MB ▼Download
dcep ➭	34.534.679 (30.362.284)	174.371	6,75% (82.655)	262 MB	71 MB ▼Download
bzi ➭	42.923.167 (40.427.447)	289.744	13,96% (170.975)	301 MB	105 MB ▼Download
dgt-aquis ➭	61.058.089 (53.111.759)	466.234	11,04% (135.226)	467 MB	108 MB ▼Download
ru-101-books ➭	87.936.969 (83.668.310)	706.772	24,83% (303.991)	534 MB	199 MB ▼Download
dezbateri-parlamentare ➭	109.244.724 (106.563.919)	250.406	14,22% (174.140)	764 MB	227 MB ▼Download
jurisprudenta ➭	114.208.968 (107.719.873)	285.542	11,02% (134.916)	798 MB	213 MB ▼Download
just ➭	188.155.635 (178.843.784)	580.225	20,16% (246.794)	1.998 MB	349 MB ▼Download
wiki-ro ➭	198.707.897 (161.989.666)	2.429.146	40,85% (500.213)	1.441 MB	341 MB ▼Download
all-readme-rotex ➭	1.051.219.439 (958.976.804)	4.467.831	62,22% (761.775)	8.007 MB	2.132 MB ▼Download

(¹) Total number of words in the source, where a word is considered any sequence of letters, even if it is not present in DEX. In parenthesis is shown the total count of words also found in DEX as a word form.

(²) Total number of types in the source, or unique words. Theoretically this should be under the number of word forms in DEX, however in some cases, where the source has fragmented sections with words from other languages, like in wikipedia, or gibberish text like 'vgr' the number can be higher.

(³) A percentage of words covered from the source from the total word forms in DEX. For example DEX has approximately 1.2 millions word forms and if in the source we have 130.000 unique words then the coverage is about 11%.

How it works

This tool automatically downloads, extracts, cleans and assemble the resulting text archives.

The build process has 3 main steps:

Download - The sources are downloaded and saved locally in original folder. This folder keeps the sources as original files (PDF, epub, etc.) or as close as possible to the original (for html sources). The download is not always optimized for speed in order to threat gently the source download servers.
Extract - The text is extracted from the original files and saved as a text file in text folder. All text corrections and transformations are done in this step. Example of corrections are cedilla diacritics replacement, restoring PDF font mappings, stripping multiple blank lines, etc.. For some sources, OCR is applied before text extraction even if it exists in the original source document, to override the low quality one available from the source, taking advantage of Tesseract 4 improvements.
Compress - The text is compressed as a .tar.gz file and saved in text-compressed folder.

It is possible to incrementally run the pipeline, by default the already completed steps are skipped. Running the complete build pipeline for all sources took more than 2 weeks on a 2.9 GHz i7 with 16 GB RAM, the main time consumers being the download (especially with many small items) and PDF OCR processes.

The built corpus is available for download as .tar.gz files for each individual source and as a single big file containing the entire text.

Note: In a previous version the text was also run through a sentence builder which recovered sentences from broken text. This is not applied any more to keep the text as close to the original as possible. Have a look at BufferedSentenceReader if you want to apply the sentence builder yourself, is working pretty well (see unit tests) but it fails in some cases, as is not using any named entity detection and the abbreviation detector is rule based.

Prerequisites

DEX Online - A MariaDB database with DEX needs to be available on localhost, see the instructions. The database is used to build a trie of word forms that is then cached to disk. On subsequent runs the trie is loaded directly without going to the database.
ocrmypdf - Tool used to apply a text layer to PDF files. It uses Tesseract for actual OCR. The simplest way to run it is to use the docker image ocrmypdf-polyglot as described here. It is executed as a separate process.
yadisk-direct - Tool used to convert links to yadisk to direct download links, for romania-inedit-forum source. It is executed as a separate process.
djvutxt - Tool used to extract text from DjVu files. It is executed as a separate process.

Contributions Welcomed

There are multiple ways to make this tool better, any help is highly appreciated:

Recommend a new source with lots (ideally >10 mil. words) of high quality text (diacritics is a must, good formatting, continuous text) in Romanian to be added;
Pick a proposed source from below, implement it and make a PR;
Improve the text extraction of an existing source and make a PR;
Help obtaining official usage rights from source owners.

TODO

http://romania-inedit.3xforum.ro/topic/83/Carti_in_limba_romana/

Sources to be considered for inclusion:

Discarded Sources

Below are sources that were considered for inclusion but rejected for various reasons:

https://biblioteca.regielive.ro/ (very low quality and structure)
http://bjconstanta.ro/resurse-digitale/carte-romaneasca/ (low quality, not that much text)
http://www.elefant.ro/list/ebooks/fictiune/literatura-romana/literatura-romana-clasica?filterprice=0+-+0 (already in bestellermd)
http://www.bibnat.ro/Biblioteca-Digitala-Nationala-s135-ro.htm (too old)
http://www.respiro.org/ebook.html (only a few in Romanian, fragmented)
http://www.cimec.ro/Biblioteca-Digitala/Biblioteca.html (too many images, very fragmented text)
https://rmj.com.ro/rmj-vol-lvi-nr-3-an-2009/ (protected for most of recent years, fragmented document types, not that much text)
http://www.tion.ro/date/2019 (under 1 mil words - implemented)
https://bunavestire.ca/revista-candela/ (too small)
http://colegiulasachi.uv.ro/scolara.html (too small)
http://www.romlit.ro/index.pl/arhiva_2018_ro (not available anymore)
http://www.banaterra.eu/biblioteca/ (not available anymore)
http://www.umft.ro/carti-in-format-electronic--medicina-generala_184 (too small)
http://www.dacoromania.inst-puscariu.ro/ (too small)
https://uituculblog.wordpress.com/citeste-online/carti-pdf/ (too small)
http://www.ceeol.com/ (not available for download)
https://www.balcanii.ro/2018/11/ (too small)
https://radiojurnalspiritual.ro/carti-alese/ (too small)