This project aims to make available, in an open and transparent way, a high quality corpus of Romanian plain texts that can be used for NLP. The transformations on original source are as non invasive as possible, for example corrections are applied for cedilla diacritics, line stitching for some PDF files that breaks the paragraphs too much, but otherwise the text is kept in the original form. Some texts are removed, for example the repetitive text from header, footer of PDF files, page numbers, etc. Given the non-structured nature of PDF files the results may vary.
Warning: Please note that not all sources are public domain and securing usage rights might be necessary in some cases.
Sources ordered by total word count in document:
Source | Word Count (¹) | Types Count (²) | DEX Coverage (³) | Uncompressed size | Compressed size |
---|---|---|---|---|---|
artapolitica ➭ | 716.033 (691.224) | 58.262 | 3,95% (48.418) | 4 MB | 1 MB ▼Download |
biblior ➭ | 1.181.820 (1.142.479) | 89.642 | 5,9% (72.298) | 7 MB | 2 MB ▼Download |
uzp ➭ | 1.553.850 (1.415.284) | 111.147 | 6,5% (79.622) | 10 MB | 4 MB ▼Download |
carti-bune-gratis ➭ | 1.619.354 (1.587.329) | 76.103 | 5,54% (67.833) | 9 MB | 3 MB ▼Download |
historica-cluj ➭ | 2.542.106 (2.215.101) | 150.298 | 6,55% (80.157) | 18 MB | 6 MB ▼Download |
destine-literale ➭ | 4.325.392 (3.686.367) | 270.410 | 11,29% (138.233) | 27 MB | 11 MB ▼Download |
certitudinea ➭ | 4.338.169 (3.846.371) | 117.207 | 6,95% (85.152) | 28 MB | 11 MB ▼Download |
paul-goma ➭ | 6.536.053 (6.111.977) | 254.228 | 10,82% (132.469) | 41 MB | 16 MB ▼Download |
rudolf-steiner ➭ | 7.678.761 (6.721.026) | 106.878 | 5,52% (67.549) | 50 MB | 15 MB ▼Download |
litera-net ➭ | 8.591.552 (8.211.512) | 263.844 | 14,88% (182.250) | 54 MB | 21 MB ▼Download |
napoca-news ➭ | 12.376.780 (11.011.431) | 297.076 | 13,07% (159.974) | 83 MB | 32 MB ▼Download |
biblioteca-digitala-ase ➭ | 16.105.049 (15.107.692) | 256.196 | 10,73% (131.383) | 121 MB | 37 MB ▼Download |
jrq-aquis ➭ | 17.934.242 (15.007.550) | 294.193 | 7,62% (93.247) | 140 MB | 44 MB ▼Download |
biblioteca-pe-mobil ➭ | 19.299.099 (17.385.248) | 419.782 | 17,09% (209.309) | 116 MB | 44 MB ▼Download |
ziarul-lumina ➭ | 23.693.901 (20.548.062) | 271.607 | 13,17% (161.249) | 168 MB | 59 MB ▼Download |
gazeta-de-cluj ➭ | 25.772.022 (24.185.518) | 320.891 | 14,09% (172.503) | 171 MB | 59 MB ▼Download |
bestseller-md ➭ | 27.766.289 (26.687.128) | 348.555 | 18,01% (220.517) | 171 MB | 63 MB ▼Download |
archive-org ➭ | 32.418.839 (30.728.463) | 761.252 | 24,58% (300.945) | 210 MB | 77 MB ▼Download |
dcep ➭ | 34.534.679 (30.362.284) | 174.371 | 6,75% (82.655) | 262 MB | 71 MB ▼Download |
bzi ➭ | 42.923.167 (40.427.447) | 289.744 | 13,96% (170.975) | 301 MB | 105 MB ▼Download |
dgt-aquis ➭ | 61.058.089 (53.111.759) | 466.234 | 11,04% (135.226) | 467 MB | 108 MB ▼Download |
ru-101-books ➭ | 87.936.969 (83.668.310) | 706.772 | 24,83% (303.991) | 534 MB | 199 MB ▼Download |
dezbateri-parlamentare ➭ | 109.244.724 (106.563.919) | 250.406 | 14,22% (174.140) | 764 MB | 227 MB ▼Download |
jurisprudenta ➭ | 114.208.968 (107.719.873) | 285.542 | 11,02% (134.916) | 798 MB | 213 MB ▼Download |
just ➭ | 188.155.635 (178.843.784) | 580.225 | 20,16% (246.794) | 1.998 MB | 349 MB ▼Download |
wiki-ro ➭ | 198.707.897 (161.989.666) | 2.429.146 | 40,85% (500.213) | 1.441 MB | 341 MB ▼Download |
all-readme-rotex ➭ | 1.051.219.439 (958.976.804) | 4.467.831 | 62,22% (761.775) | 8.007 MB | 2.132 MB ▼Download |
(¹) Total number of words in the source, where a word is considered any sequence of letters, even if it is not present in DEX. In parenthesis is shown the total count of words also found in DEX as a word form.
(²) Total number of types in the source, or unique words. Theoretically this should be under the number of word forms in DEX, however in some cases, where the source has fragmented sections with words from other languages, like in wikipedia, or gibberish text like 'vgr' the number can be higher.
(³) A percentage of words covered from the source from the total word forms in DEX. For example DEX has approximately 1.2 millions word forms and if in the source we have 130.000 unique words then the coverage is about 11%.
This tool automatically downloads, extracts, cleans and assemble the resulting text archives.
The build process has 3 main steps:
- Download - The sources are downloaded and saved locally in
original
folder. This folder keeps the sources as original files (PDF, epub, etc.) or as close as possible to the original (for html sources). The download is not always optimized for speed in order to threat gently the source download servers. - Extract - The text is extracted from the original files and saved as a text file in
text
folder. All text corrections and transformations are done in this step. Example of corrections are cedilla diacritics replacement, restoring PDF font mappings, stripping multiple blank lines, etc.. For some sources, OCR is applied before text extraction even if it exists in the original source document, to override the low quality one available from the source, taking advantage of Tesseract 4 improvements. - Compress - The text is compressed as a .tar.gz file and saved in
text-compressed
folder.
It is possible to incrementally run the pipeline, by default the already completed steps are skipped. Running the complete build pipeline for all sources took more than 2 weeks on a 2.9 GHz i7 with 16 GB RAM, the main time consumers being the download (especially with many small items) and PDF OCR processes.
The built corpus is available for download as .tar.gz files for each individual source and as a single big file containing the entire text.
Note: In a previous version the text was also run through a sentence builder which recovered sentences from
broken text. This is not applied any more to keep the text as close to the original as possible.
Have a look at BufferedSentenceReader
if you want to apply the sentence builder yourself, is working pretty well
(see unit tests) but it fails in some cases, as is not using any named entity detection and the abbreviation
detector is rule based.
- DEX Online - A MariaDB database with DEX needs to be available on localhost, see the instructions. The database is used to build a trie of word forms that is then cached to disk. On subsequent runs the trie is loaded directly without going to the database.
- ocrmypdf - Tool used to apply a text layer to PDF files. It uses Tesseract for actual OCR. The simplest way to run it is to use the docker image ocrmypdf-polyglot as described here. It is executed as a separate process.
- yadisk-direct - Tool used to convert links to yadisk to direct download links, for romania-inedit-forum source. It is executed as a separate process.
- djvutxt - Tool used to extract text from DjVu files. It is executed as a separate process.
There are multiple ways to make this tool better, any help is highly appreciated:
- Recommend a new source with lots (ideally >10 mil. words) of high quality text (diacritics is a must, good formatting, continuous text) in Romanian to be added;
- Pick a proposed source from below, implement it and make a PR;
- Improve the text extraction of an existing source and make a PR;
- Help obtaining official usage rights from source owners.
Sources to be considered for inclusion:
Below are sources that were considered for inclusion but rejected for various reasons:
- https://biblioteca.regielive.ro/ (very low quality and structure)
- http://bjconstanta.ro/resurse-digitale/carte-romaneasca/ (low quality, not that much text)
- http://www.elefant.ro/list/ebooks/fictiune/literatura-romana/literatura-romana-clasica?filterprice=0+-+0 (already in bestellermd)
- http://www.bibnat.ro/Biblioteca-Digitala-Nationala-s135-ro.htm (too old)
- http://www.respiro.org/ebook.html (only a few in Romanian, fragmented)
- http://www.cimec.ro/Biblioteca-Digitala/Biblioteca.html (too many images, very fragmented text)
- https://rmj.com.ro/rmj-vol-lvi-nr-3-an-2009/ (protected for most of recent years, fragmented document types, not that much text)
- http://www.tion.ro/date/2019 (under 1 mil words - implemented)
- https://bunavestire.ca/revista-candela/ (too small)
- http://colegiulasachi.uv.ro/scolara.html (too small)
- http://www.romlit.ro/index.pl/arhiva_2018_ro (not available anymore)
- http://www.banaterra.eu/biblioteca/ (not available anymore)
- http://www.umft.ro/carti-in-format-electronic--medicina-generala_184 (too small)
- http://www.dacoromania.inst-puscariu.ro/ (too small)
- https://uituculblog.wordpress.com/citeste-online/carti-pdf/ (too small)
- http://www.ceeol.com/ (not available for download)
- https://www.balcanii.ro/2018/11/ (too small)
- https://radiojurnalspiritual.ro/carti-alese/ (too small)