This repository contains scrapers and lists of links for various online sources of data in Leichte Sprache.
To make use of our preprocessing pipeline refer to the following steps.
- Create a new folder (with the name of your dataset).
- Add folders for monolingual and/or aligned data.
- For monolingual data: place another subfolder 'monolingual' inside the first folder.
- For aligned data: place another subfolder 'aligned' inside the first folder.
- Add the text files
- For monolingual data: compress your raw .txt files into a zip file called 'corpus.zip' to your and place it in your 'monolingual' folder. Make sure the filenames in your corpus.zip meet the following criteria:
- no spaces
- no german "umlaut"
- no other special characters as brackets ...
- For aligned data: place your csv file (arbitrary name) in your 'aligned' folder
make sure to have one column normal_phrase and one column simple_phrase in your csv.
- If it already exists you can delete the current clean_data folder.
- Run
python3 preprocess.py
to generate preprocessed data. - You can now find your cleaned data in the folder clean_data.
If you have added new data, repeat steps 4 to 6 to update the already cleaned data again.
- If characters are corrupted switch between text.decode() and text.decode("latin-1")
dictionary-meta.csv contains a list of urls to dictionary entries we used for compiling a sentence-wise manually aligned dataset. The election program and MDR news corpus can be provided upon request.
source | # articles | # sentences | quality | type |
---|---|---|---|---|
hurraki | 3 911 | 56 785 | good | lexicon |
nachrichtenleicht | 7 709 | 122 842 | good | news |
klexikon | 2 350 | 80 042 | medium | lexicon |
ndr | 1 817 | 60 749 | good | news |
einfachstars | 6 488 | 129 674 | good | news |
hdasprachtechnologie | 44 | 4 210 | good | misc. |
lebenshilfe | 396 | 7 144 | good | misc. |
kurier | 4 519 | 67 827 | good | news |
total | 27 234 | 529 654 |
source | # articles | # sentences | quality | type |
---|---|---|---|---|
original_kurier | 3 476 | 77 587 | good | news |
total | 3 476 | 77 587 |
source | # sentences | creation method | leichte sprache | type |
---|---|---|---|---|
wiki_translation | 488 001 | scraper | no | lexicon |
apa_a2 | 9 456 | lha scraper | yes | news |
apa_b1 | 10 268 | lha scraper | yes | news |
kurier | 40 772 | vecalign scraper | yes | news |
total | 548 497 |
source | # sentences | creation method | leichte sprache | type |
---|---|---|---|---|
mdr news | 50 | manually aligned | yes | news |
mdr dictionary | 50 | manually aligned | yes | lexicon |
brandeins(*) | 106 | scraped (color coding) | yes | various |
wiki_auto_test | 147 | manually reviewed | no | lexicon |
wiki_auto_dev | 59 | manually reviewed | no | lexicon |
total | 412 |
source | # sentences | creation method | leichte sprache | type |
---|---|---|---|---|
wahlprogramm | 107 | manually aligned | yes | election program |
mdr news | 50 | manually aligned | yes | news |
mdr dictionary | 50 | manually aligned | yes | lexicon |
brandeins(*) | 106 | scraped (color coding) | yes | various |
total | 313 |
(*) Created with the help of Daniel Berger (da.berger@tum.de)
manually aligned: source and target phrases/sentences are aligned by hand by a human person
manually reviewed: alignment is already done (in this case english alignments are translated to german) and a human person corrects grammar and other mistakes
scraped (color coding): target and source phrases are displayed in different colors and can be automatically aligned
- Geasy corpus (second sheet in file) with collection on German Easy Language data sources
- Hurraki kind of "Leichte Sprache" Wikipedia (approx. 40.000 phrases of different topics) [html]
- Das Parlament (approx. 200 articles of different legal topics / sociological topics) [pdf]
- Bundestag Website (15 articles describing the work of the german parliament) [html]
- Federal Ministry of Labour and Social Affairs ( 17 articles of different legal / social topics; partly with corresponding original articles in standard German) [pdf]
- Bible texts (approx. 180 religious texts) [html]
- Federal Ministry of Justice (some word explanations concerning legal terms) [html]
- Einfach Teilhaben (Federal Ministry of Labour and Social Affairs) (> 30 articles of different social topics) [html]
- Deutsches Institut für Menschenrechte (approx. 30 articles of different legal / social topics; partly with corresponding original text in standard German) [pdf]
- News in "Leichte Sprache" (Deutschlandfunk) (lexicon + about 4 new articles per week) [html]
- News in "Leichte Sprache" (MDR) (lexicon + about 300 articles total from the last 3 months from states of Thüringen, Saxen, Saxen-Anhalt) [html]
- nachrichtenleicht
- KLexikon - article-aligned, however not in Leichte Sprache
- hdaSprachtechnologie - article-aligned:
Topic | # chars (leichte) | # chars (standard) | # sentences (leichte) | # sentences (standard) | Notes |
---|---|---|---|---|---|
Election Programs | 40 406 | 782 843 | 773 | 6 376 | - |
Bible | 32 663 | 28 044 | 806 | 338 | - |
Tales | 104 647 | 92 335 | 2 332 | 587 | - |
BRK | 9 263 | 26 534 | 152 | 170 | - |
News | 6 197 | 6 305 | 102 | 81 | Standard German translation is sometimes mixed with "Leichte Sprache" |
Books | 1 863 | 2 666 | 45 | 29 | - |
- BRK (Convention on the Rights of Persons with Disabilities) 1 document in different languages including standard german and "Leichte Sprache" [pdf]
- Collected parallel data sources in Geasy corpus: online
- GWW
- Heilpädagogische Hilfe Osnabrück
- Lebenshilfe Main-Taunus
- OWB
- einfach teilhaben
- capito
Much of the data comes from government websites or texts in a legal context
Possible solutions
Use sentence embedding methods to cluster the phrases from the data.
You may be able to recognize clusters that can be assigned to certain topics or writing styles.
Phrases from these clusters can then be weighted differently.