This repository contains the "Lingala Speech Translation (LiSTra)" dataset presented on the paper entitled LiSTra, Automatic Speech Translation : English to Lingala casestudy.
For copyright reasons, we are not able to share the audio files however, we provide the extraction pipeline below. We also highlight this pipeline can be used to new languages of interested.
- Source language Audio Download and Renaming : 01-Audio download Rename.ipynb
- Download the corresponding text file for the source language : 02-BibleisWebScrapping.ipynb.
- Download the corresponding text file for the target language : 03-JwWebscraping.ipynb.
- Generate TextGrid files : 04-webMausWavGenerator.ipynb.
- Rename the wav files : 05.Rename_wav_files.ipynb.
After having you dataset you may need to run the following script to check for specific missing file:
- If the two forders contains text: bash check_diff.sh english/ lingala/ false
- If the second folder is a folder to the waves : bash check_diff.sh english/ wav_verse/ true
- To compare raw_txt with TextGrid : bash check_diff_TextGrid.sh english/raw_txt/ english/maus_textgrid/ true
Note: Please make sure the first param is the txt and the second is wav, if both are txt juste put the last param to false.
The speech-to-speech retrieval baseline model proposed at the paper is available here.
- MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
- Yuchen Liu and al. paper
You can contact them me at skabenamualu@aimsammi.org