SWC + M-AILABS Corpora

Question

SWC + M-AILABS Corpora

fabianbusch opened this issue 5 years ago · 6 comments

Have you already tried to improve the german model with these both datasets?
I have seen that it was an option in pre-processing ... I am very curious about wether this could improve the model, but I do not have the computing-power and "in-deph-model-development-knowlegde". Here is some Kaldi-based example where this dataset combination worked very well: https://github.com/uhh-lt/kaldi-tuda-de

Looking forward to hear from you!

All the best
Fabian

Answer 1 · 2019-12-02T20:50:42.000Z

Thanks. It is on the todo list.

Answer 2 · 2020-04-13T15:32:31.000Z

@fabianbusch : We have released v0.6.0. You can find the link in the ReadMe. SWC is still in the pipeline.

Answer 3 · 2020-04-27T09:40:24.000Z

Thank you very much for keeping me up to date :)

Answer 4 · 2020-10-27T05:07:49.000Z

This 2 datasets can be downloaded with audiomate, like the voxforge dataset. see here for a list of available Downloaders.
By the way also the TUDA and Common Voice datasets can be downloaded with audiomate, so no need to manually download the datasets.

An example segment downloading the used datasets:

from audiomate.corpus.io import SWCDownloader
from audiomate.corpus.io import MailabsDownloader
from audiomate.corpus.io import CommonVoiceDownloader
from audiomate.corpus.io import TudaDownloader
from audiomate.corpus.io import VoxforgeDownloader

download_dir = '/path/to/download/dir

dl_swc = SWCDownloader(lang='de')
dl_mailabs = MailabsDownloader(tags='de_DE')
dl_common_voice = CommonVoiceDownloader(lang='de')
dl_tuda = TudaDownloader()
dl_voxforge = VoxforgeDownloader(lang='de')

dl_swc.download(os.path.join(download_dir, 'swc'))
dl_mailabs.download(os.path.join(download_dir, 'mailabs'))
dl_common_voice.download(os.path.join(download_dir, 'common-voice'))
dl_tuda.download(os.path.join(download_dir, 'tuda')')
dl_voxforge.download(os.path.join(download_dir, 'voxforge'))

Another example loading, merging, cleaning (text_cleaning module of this project), splitting and writing the datasets:

from audiomate.corpus import Corpus
from audiomate.corpus.subset import Splitter
from audiomate.corpus import io
from audiomate.corpus import LL_WORD_TRANSCRIPT
import os
import text_cleaning

corpora = list()
download_dir = '/path/to/download/dir'

def clean_transcriptions(corpus):
    for utterance in corpus.utterances.values():
        ll = utterance.label_lists[LL_WORD_TRANSCRIPT]

        for label in ll:
            label.value = text_cleaning.clean_sentence(label.value)

# loading corpora
corpora.append(Corpus.load(os.path.join(download_dir, 'swc'), reader='swc'))
corpora.append(Corpus.load(os.path.join(download_dir, 'mailabs'), reader='mailabs'))
corpora.append(Corpus.load(os.path.join(download_dir, 'common-voice'), reader='common-voice'))
corpora.append(Corpus.load(os.path.join(download_dir, 'tuda'), reader='tuda'))
corpora.append(Corpus.load(os.path.join(download_dir, 'voxforge'), reader='voxforge'))

# merging and cleaning corpora
merged_corpora = Corpus.merge_corpora(corpora)
clean_transcriptions(merged_corpora)

# splitting merged corpora
splitter = Splitter(merged_corpora, random_seed=42)
splitted_corpora = splitter.split(proportions={
        'train': 0.7,
        'dev': 0.15,
        'test': 0.15
    }, separate_issuers=True)

# import splitted parts in merged corpora
merged_corpora.import_subview('train', splitted_corpora['train'])
merged_corpora.import_subview('dev', splitted_corpora['dev'])
merged_corpora.import_subview('test', splitted_corpora['test'])

# write merged corpora in DeepSpeech format
deepspeech_writer = io.MozillaDeepSpeechWriter()
deepspeech_writer.save(merged_corpora, '/path/to/write/corpora')

Hopefully it helped

Answer 5 · 2020-11-20T19:17:15.000Z

@fabianbusch : SWC and MAILABS corpus is added to the new release model v0.9.0.

Closing the ticket.

Answer 6 · 2020-11-20T19:32:59.000Z

@fabianbusch : SWC and MAILABS corpus is added to the new release model v0.9.0.

Closing the ticket.

Thanks 👍🤓