/ReadME-RoTex-Corpus-Builder

Builds a corpus of Romanian text, suitable for NLP research, from different online sources.

Primary LanguageKotlinGNU General Public License v3.0GPL-3.0

ReadME RoTex Corpus Builder

Word Count Dex Types Coverage Build Status

Description

This project aims to make available, in an open and transparent way, a high quality corpus of Romanian plain texts that can be used for NLP. The transformations on original source are as non invasive as possible, for example corrections are applied for cedilla diacritics, line stitching for some PDF files that breaks the paragraphs too much, but otherwise the text is kept in the original form. Some texts are removed, for example the repetitive text from header, footer of PDF files, page numbers, etc. Given the non-structured nature of PDF files the results may vary.

Warning: Please note that not all sources are public domain and securing usage rights might be necessary in some cases.

Sources

Sources ordered by total word count in document:

Source Word Count (¹) Types Count (²) DEX Coverage (³) Uncompressed size Compressed size
artapolitica 716.033 (691.224) 58.262 3,95% (48.418) 4 MB 1 MB ▼Download
biblior 1.181.820 (1.142.479) 89.642 5,9% (72.298) 7 MB 2 MB ▼Download
uzp 1.553.850 (1.415.284) 111.147 6,5% (79.622) 10 MB 4 MB ▼Download
carti-bune-gratis 1.619.354 (1.587.329) 76.103 5,54% (67.833) 9 MB 3 MB ▼Download
historica-cluj 2.542.106 (2.215.101) 150.298 6,55% (80.157) 18 MB 6 MB ▼Download
destine-literale 4.325.392 (3.686.367) 270.410 11,29% (138.233) 27 MB 11 MB ▼Download
certitudinea 4.338.169 (3.846.371) 117.207 6,95% (85.152) 28 MB 11 MB ▼Download
paul-goma 6.536.053 (6.111.977) 254.228 10,82% (132.469) 41 MB 16 MB ▼Download
rudolf-steiner 7.678.761 (6.721.026) 106.878 5,52% (67.549) 50 MB 15 MB ▼Download
litera-net 8.591.552 (8.211.512) 263.844 14,88% (182.250) 54 MB 21 MB ▼Download
napoca-news 12.376.780 (11.011.431) 297.076 13,07% (159.974) 83 MB 32 MB ▼Download
biblioteca-digitala-ase 16.105.049 (15.107.692) 256.196 10,73% (131.383) 121 MB 37 MB ▼Download
jrq-aquis 17.934.242 (15.007.550) 294.193 7,62% (93.247) 140 MB 44 MB ▼Download
biblioteca-pe-mobil 19.299.099 (17.385.248) 419.782 17,09% (209.309) 116 MB 44 MB ▼Download
ziarul-lumina 23.693.901 (20.548.062) 271.607 13,17% (161.249) 168 MB 59 MB ▼Download
gazeta-de-cluj 25.772.022 (24.185.518) 320.891 14,09% (172.503) 171 MB 59 MB ▼Download
bestseller-md 27.766.289 (26.687.128) 348.555 18,01% (220.517) 171 MB 63 MB ▼Download
archive-org 32.418.839 (30.728.463) 761.252 24,58% (300.945) 210 MB 77 MB ▼Download
dcep 34.534.679 (30.362.284) 174.371 6,75% (82.655) 262 MB 71 MB ▼Download
bzi 42.923.167 (40.427.447) 289.744 13,96% (170.975) 301 MB 105 MB ▼Download
dgt-aquis 61.058.089 (53.111.759) 466.234 11,04% (135.226) 467 MB 108 MB ▼Download
ru-101-books 87.936.969 (83.668.310) 706.772 24,83% (303.991) 534 MB 199 MB ▼Download
dezbateri-parlamentare 109.244.724 (106.563.919) 250.406 14,22% (174.140) 764 MB 227 MB ▼Download
jurisprudenta 114.208.968 (107.719.873) 285.542 11,02% (134.916) 798 MB 213 MB ▼Download
just 188.155.635 (178.843.784) 580.225 20,16% (246.794) 1.998 MB 349 MB ▼Download
wiki-ro 198.707.897 (161.989.666) 2.429.146 40,85% (500.213) 1.441 MB 341 MB ▼Download
all-readme-rotex 1.051.219.439 (958.976.804) 4.467.831 62,22% (761.775) 8.007 MB 2.132 MB ▼Download

(¹) Total number of words in the source, where a word is considered any sequence of letters, even if it is not present in DEX. In parenthesis is shown the total count of words also found in DEX as a word form.

(²) Total number of types in the source, or unique words. Theoretically this should be under the number of word forms in DEX, however in some cases, where the source has fragmented sections with words from other languages, like in wikipedia, or gibberish text like 'vgr' the number can be higher.

(³) A percentage of words covered from the source from the total word forms in DEX. For example DEX has approximately 1.2 millions word forms and if in the source we have 130.000 unique words then the coverage is about 11%.

How it works

This tool automatically downloads, extracts, cleans and assemble the resulting text archives.

The build process has 3 main steps:

  • Download - The sources are downloaded and saved locally in original folder. This folder keeps the sources as original files (PDF, epub, etc.) or as close as possible to the original (for html sources). The download is not always optimized for speed in order to threat gently the source download servers.
  • Extract - The text is extracted from the original files and saved as a text file in text folder. All text corrections and transformations are done in this step. Example of corrections are cedilla diacritics replacement, restoring PDF font mappings, stripping multiple blank lines, etc.. For some sources, OCR is applied before text extraction even if it exists in the original source document, to override the low quality one available from the source, taking advantage of Tesseract 4 improvements.
  • Compress - The text is compressed as a .tar.gz file and saved in text-compressed folder.

It is possible to incrementally run the pipeline, by default the already completed steps are skipped. Running the complete build pipeline for all sources took more than 2 weeks on a 2.9 GHz i7 with 16 GB RAM, the main time consumers being the download (especially with many small items) and PDF OCR processes.

The built corpus is available for download as .tar.gz files for each individual source and as a single big file containing the entire text.

Note: In a previous version the text was also run through a sentence builder which recovered sentences from broken text. This is not applied any more to keep the text as close to the original as possible. Have a look at BufferedSentenceReader if you want to apply the sentence builder yourself, is working pretty well (see unit tests) but it fails in some cases, as is not using any named entity detection and the abbreviation detector is rule based.

Prerequisites

  • DEX Online - A MariaDB database with DEX needs to be available on localhost, see the instructions. The database is used to build a trie of word forms that is then cached to disk. On subsequent runs the trie is loaded directly without going to the database.
  • ocrmypdf - Tool used to apply a text layer to PDF files. It uses Tesseract for actual OCR. The simplest way to run it is to use the docker image ocrmypdf-polyglot as described here. It is executed as a separate process.
  • yadisk-direct - Tool used to convert links to yadisk to direct download links, for romania-inedit-forum source. It is executed as a separate process.
  • djvutxt - Tool used to extract text from DjVu files. It is executed as a separate process.

Contributions Welcomed

There are multiple ways to make this tool better, any help is highly appreciated:

  • Recommend a new source with lots (ideally >10 mil. words) of high quality text (diacritics is a must, good formatting, continuous text) in Romanian to be added;
  • Pick a proposed source from below, implement it and make a PR;
  • Improve the text extraction of an existing source and make a PR;
  • Help obtaining official usage rights from source owners.

TODO

Sources to be considered for inclusion:

Discarded Sources

Below are sources that were considered for inclusion but rejected for various reasons: