TudaDataset: A Python repository from Alienmaster

README for German Distant Speech Data Corpus 2014 / 2015
########################################################

Language Technology and Telecooperation labs, TU-Darmstadt, Germany
https://www.lt.tu-darmstadt.de
https://www.tk.informatik.tu-darmstadt.de

* This archive contains CC-BY licensed speech corpus data presented in:

Stephan Radeck-Arneth, Benjamin Milde, Arvid Lange, Evandro Gouvea, Stefan Radomski, Max Mühlhäuser, Chris Biemann, "Open Source German Distant Speech Recognition: Corpus and Acoustic Model", Proceedings of Text, Speech and Dialogue (TSD), 2015

Please cite this paper if you use our data for research purposes. 
See LICENSE for details on the license.

* General information

- The speech data was collected in a controlled environment (same room, same microphone distances, etc. )
- Distance between speakers and the microphones is about 1 meter
- Each recording has a xml transcription file that also includes speaker meta data
- The recordings include several concurrent audio-streams from different microphones
- The data is curated (manually checked and corrected), to reduce errors and artefacts
- The speech data is divided into three independent data sets: Training / Test / Dev, Test and Dev contains new sentences and new speakers that are not part of training set, in order to assess model quality in a speaker-independent open-vocabulary setting.

* Information about the data collection procedure: 

Training set (recordings in 2014):
The sentences come from two main data sources: German Wikipedia (Spring 2014) and from the Europarl Corpus. Sentences were randomly chosen from German Wikipedia and Europarl Corpus, to be read by the speakers. The Europarl corpus (Release v7) is a collection of the proceedings of the European Parliament between 1996 and 2011, generated by Philipp Koehn (Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, http://www.statmt.org/europarl/). As third data set, German command and control sentences, were manually specified and would be typical for a command and control setting in living rooms.
 
For the test/dev set (recordings in 2015):
Additional sentences from the German Wikipedia and from the Europarl Corpus have selected for the recordings. Additionally, we collected German sentences from the web by crawling the German top-level-domain and applying language filtering and deduplification. Exclusively sentences starting with quotation marks were selected and randomly sampled. The three text sources are represented with approximately equal amounts of recordings in the test/dev set.

* Structure of file names:

The XML-file meta files for each utterance share the same prefix as the several Wave-files. 

E.g. utterance 2014-03-27-13-32-38 is the following files:

2014-03-27-13-32-38.xml
2014-03-27-13-32-38_Kinect-Beam.wav (Kinect 1 Beamformed Audio signal through Kinect SDK)
2014-03-27-13-32-38_Kinect-RAW.wav (Kinect 1 Direct Access as normal microphone)
2014-03-27-13-32-38_Realtek.wav (Internal Realtek Mic of Asus PC - near noisy fan)
2014-03-27-13-32-38_Samson.wav (Samson C01U)
2014-03-27-13-32-38_Yamaha.wav (Yamaha PSG-01S)

The XML meta-file includes the sentence with the original text representation taken from the various text corpora and a cleaned version, where the sentence is normalised to resemble what speakers actually said as closely as possible. 

The following is an example of the XML structure of the meta files. Besides the transcriptions, it also includes meta data about the speaker. 

file: train/2014-08-13-14-33-22.xml

<?xml version="1.0" encoding="utf-8"?>
<recording>
        <speaker_id>c449bd24-7f29-44cf-920e-9c943d50e32a</speaker_id>
        <rate>16000</rate>
        <angle>-10,0267614147894</angle>
        <gender>female</gender>
        <ageclass>21-30</ageclass>
        <sentence_id>50</sentence_id>
        <sentence>Auf Grund der unterschiedlichen Färbung war bis ca. 1900 auch die Bezeichnung Goldadler für ausgewachsene Steinadler gebräuchlich.</sentence>
        <cleaned_sentence>Auf Grund der unterschiedlichen Färbung war bis circa neunzehnhundert auch die Bezeichnung Goldadler für ausgewachsene Steinadler gebräuchlich</cleaned_sentence>
        <corpus>WIKI</corpus>
        <muttersprachler>Ja</muttersprachler>
        <bundesland>Hessen</bundesland>
        <sourceurls><url>https://de.wikipedia.org/wiki/Steinadler</url></sourceurls>
</recording>

In <Speaker_id> is a unique and anoymized ID for the speaker who read the sentence. Some meta data like gender, ageclass, German native speaker or not (Muttersprachler) and state (Bundesland) are also available. Most speakers are from Hesse (Hessen) and are between 21-30 years old.

We kept the raw sentence in <sentence> and have included the normalised version of the sentence in <cleaned_sentence>, where most notably literals and numbers are canonicalized to their full written forms and any punctuation is discarded. The normalised form should be used for training acoustic models.

There are four possible text sources, <corpus> states from which corpus the utterance was selected: "WIKI" for the German wikipedia, "PARL" for the European Parliament Proceedings Parallel Corpus, see http://www.statmt.org/europarl/, "Commands" for short commands typical of a command and control setting and "CITE" for crawled citations of direct speech. In <sourceurls>, if avaiable, is one or more URLs pointing to the source document(s) of the utterance.

* Structure of folders in package:

Train: includes the recordings for the training set. Sentences in this folders were recorded during 2014. Some sentences are allowed to be recorded several times - by the same speaker or by a different speaker. 

Test / Dev: includes recordings for the test and dev set. These sentences only occur once, there is no overlap with sentences in Train and the Test / Dev recordings were conducted with a different set of speakers. Each sentence in Test / Dev is unique, i.e. just recorded once by one speaker.

Number of recorded sentences:
- Train: 14717
- Dev: 1085
- Test: 1028

* Training acousic models

KALDI training scripts for this corpus can be found at https://github.com/tudarmstadt-lt/kaldi-tuda-de . You can also report transcription and normalization errors of this corpus in the issue tracker of this project.

* Editing normalized and cleaned sentences

Python scripts (2.7) are included to export the cleaned sentences from all XML files into a single text file and to import them back again into the XML transcription files. This makes it easier to edit the cleaned sentences.

See:

python export_cleaned_sentences.py --help
python import_cleaned_sentences.py --help
Alienmaster/TudaDataset