Overview
This repository contains the code and instructions needed to reproduce the dataset splits for "Speech Translation for Code-Switched Speech".
You can create both datasets with the bash create_datasets.sh
command, following the instructions in the Instructions Section. The fisher
and miami
directories contain the scripts needed to for each dataset used by bash create_datasets.sh
.
A mapping between the original data and the new code-switched and monolingual splits used in the paper can be found in mapping_files
. Note that running bash create_datasets.sh
will create these mappings.
Instructions
- Install the prerequisite libraries for linux/macOS. This includes
ffmpeg
,sox
,wget
, andpython
(e.g.apt-get install sox
). - Run
pip install -r requirement.txt
to setup the python enviroment - Collect the data needed for the Fisher corpus (LDC2010T04 and LDC2010S01) and export them:
export LDC2010S01={path_to_LDC2010S01}
andexport LDC2010T04={path_to_LDC2010T04}/fisher_spa_tr
. - Run
bash create_datasets.sh
to generate both Miami and Fisher datasets.
Example
Example utterance:
- (Audio clip)
- Transcript (code-switched): y ti bueno tiene dos papás which can be a little can be a little challenging.
- Translation (English only): and she has two fathers which can be a little, can be a little challenging.
The data files are composed of three parts:
- The transcript for the dataset split (in
{dataset_name}.translation
) - The translation for the dataset split (in
{dataset_name}.translation
) - The audio for the dataset split (in
{dataset_name}.yaml
and{dataset_name}/clips/*.wav
or{dataset_name}/clips.zip
)
Citation
If you found this repository helpful in your research, please consider citing
Orion Weller, Matthias Sperber, Telmo Pessoa Pires, Hendra Setiawan, Christian Gollan, Dominic Telaar, Matthias Paulik: End-to-End Speech Translation for Code Switched Speech (Findings of the Association for Computational Linguistics: ACL 2022)