SWC + M-AILABS Corpora
fabianbusch opened this issue ยท 6 comments
Have you already tried to improve the german model with these both datasets?
I have seen that it was an option in pre-processing ... I am very curious about wether this could improve the model, but I do not have the computing-power and "in-deph-model-development-knowlegde". Here is some Kaldi-based example where this dataset combination worked very well: https://github.com/uhh-lt/kaldi-tuda-de
Looking forward to hear from you!
All the best
Fabian
Thanks. It is on the todo list.
@fabianbusch : We have released v0.6.0. You can find the link in the ReadMe. SWC is still in the pipeline.
Thank you very much for keeping me up to date :)
This 2 datasets can be downloaded with audiomate, like the voxforge dataset. see here for a list of available Downloaders.
By the way also the TUDA and Common Voice datasets can be downloaded with audiomate, so no need to manually download the datasets.
An example segment downloading the used datasets:
from audiomate.corpus.io import SWCDownloader
from audiomate.corpus.io import MailabsDownloader
from audiomate.corpus.io import CommonVoiceDownloader
from audiomate.corpus.io import TudaDownloader
from audiomate.corpus.io import VoxforgeDownloader
download_dir = '/path/to/download/dir
dl_swc = SWCDownloader(lang='de')
dl_mailabs = MailabsDownloader(tags='de_DE')
dl_common_voice = CommonVoiceDownloader(lang='de')
dl_tuda = TudaDownloader()
dl_voxforge = VoxforgeDownloader(lang='de')
dl_swc.download(os.path.join(download_dir, 'swc'))
dl_mailabs.download(os.path.join(download_dir, 'mailabs'))
dl_common_voice.download(os.path.join(download_dir, 'common-voice'))
dl_tuda.download(os.path.join(download_dir, 'tuda')')
dl_voxforge.download(os.path.join(download_dir, 'voxforge'))
Another example loading, merging, cleaning (text_cleaning module of this project), splitting and writing the datasets:
from audiomate.corpus import Corpus
from audiomate.corpus.subset import Splitter
from audiomate.corpus import io
from audiomate.corpus import LL_WORD_TRANSCRIPT
import os
import text_cleaning
corpora = list()
download_dir = '/path/to/download/dir'
def clean_transcriptions(corpus):
for utterance in corpus.utterances.values():
ll = utterance.label_lists[LL_WORD_TRANSCRIPT]
for label in ll:
label.value = text_cleaning.clean_sentence(label.value)
# loading corpora
corpora.append(Corpus.load(os.path.join(download_dir, 'swc'), reader='swc'))
corpora.append(Corpus.load(os.path.join(download_dir, 'mailabs'), reader='mailabs'))
corpora.append(Corpus.load(os.path.join(download_dir, 'common-voice'), reader='common-voice'))
corpora.append(Corpus.load(os.path.join(download_dir, 'tuda'), reader='tuda'))
corpora.append(Corpus.load(os.path.join(download_dir, 'voxforge'), reader='voxforge'))
# merging and cleaning corpora
merged_corpora = Corpus.merge_corpora(corpora)
clean_transcriptions(merged_corpora)
# splitting merged corpora
splitter = Splitter(merged_corpora, random_seed=42)
splitted_corpora = splitter.split(proportions={
'train': 0.7,
'dev': 0.15,
'test': 0.15
}, separate_issuers=True)
# import splitted parts in merged corpora
merged_corpora.import_subview('train', splitted_corpora['train'])
merged_corpora.import_subview('dev', splitted_corpora['dev'])
merged_corpora.import_subview('test', splitted_corpora['test'])
# write merged corpora in DeepSpeech format
deepspeech_writer = io.MozillaDeepSpeechWriter()
deepspeech_writer.save(merged_corpora, '/path/to/write/corpora')
Hopefully it helped
@fabianbusch : SWC and MAILABS corpus is added to the new release model v0.9.0.
Closing the ticket.
@fabianbusch : SWC and MAILABS corpus is added to the new release model v0.9.0.
Closing the ticket.
Thanks ๐๐ค