The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks
This project is the official implementation of the MineTrans English-to-Chinese speech transaltion system for the IWSLT2023 speech-to-speech translation (S2ST) track and the offline speech translation (S2T) track.
🌐 Demo Page • 🤗 HuggingFace Page(Coming soon) • 📃 Paper • 📽️ Slide • ⏬ Data • 🤖 Model
Team: Yichao Du, Zhengsheng Guo, Jinchuan Tian, Zhirui Zhang, Xing Wang, Jianwei Yu, Zhaopeng Tu, Tong Xu, and Enhong Chen
git clone https://github.com/duyichao/MineTrans-IWSLT23.git
cd MineTrans-IWSLT23
pip install -e ./fairseq
pip install -r requirements.txt
Language | Speech Encoder | Block type | Model size | Dataset | KM-Model |
---|---|---|---|---|---|
En | Wav2vec 2.0 | Conformer | Large | Voxpopuli & GigaSS | × |
Zh | HuBert | Transformer | Base | GigaSS & AISHELL3 | layer6.km250 |
Models | ASR-BLEU | ASR-charF | Checkpoint |
---|---|---|---|
W2V2-CONF-LARGE | 27.7 | 23.4 | download |
W2V2-CONF-LARGE+T2U | 27.8 | 23.7 | download |
HUBERT-TRANS-LARGE+T2U | 26.2 | 23.2 | download |
HUBERT-TRANS-LARGE+T2U* | 25.7 | 22.6 | download |
Unit config | Unit size | Language | Dataset | Model |
---|---|---|---|---|
HuBERT Base, layer 6 | 250 | Zh | GigaSS-S (200h) | d_500000 |
Dataset should be prepared into the following format.
id audio n_frames tgt_text tgt_n_frames
YOU0000010267_S0001707 /path/to/YOU0000010267_S0001707.wav 49600 44 127 27 66 46 100
YOU0000016336_S0001298 /path/to/YOU0000016336_S0001298.wav 83200 44 239 222 46 202
- Follow the same inference process as in fairseq-S2T to generate units (
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt
).
CFG=config_u250_s2ut_audio.yaml
CKPT_S2UT=/path/to/checkpoint
RESULTS_PATH=/path/to/results
EVAL_DATA_PATH=/path/to/eval_data
GEN_SUBSET=/path/to/test_data
mkdir ${RESULTS_PATH} -p
CUDA_VISIBLE_DEVICES=1 \
fairseq-generate ${EVAL_DATA_PATH} \
--config-yaml ${CFG} \
--task speech_to_text \
--path ${CKPT_S2UT} --gen-subset ${GEN_SUBSET} \
--max-tokens 2000000 --max-source-positions 2000000 --max-target-positions 10000 \
--beam 10 --max-len-a 1 --max-len-b 200 --lenpen 1 \
--scoring sacrebleu \
--required-batch-size-multiple 1 \
--results-path ${RESULTS_PATH}
- Convert unit sequences to waveform through unit-based HiFi-GAN vocoder.
VOCODER_CFG=/path/to/vocoder_cfg
VOCODER_CKPT=/path/to/vocoder_ckpt
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt |
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
>${RESULTS_PATH}/generate-${GEN_SUBSET}.hyp.unit
grep "^T\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt |
sed 's/^D-//ig' | sort -nk1 | cut -f2 \
>${RESULTS_PATH}/generate-${GEN_SUBSET}.ref.unit
mkdir ${RESULTS_PATH}/audio_gen -p
python3 ./minetrans/scripts/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.hyp.unit \
--vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
--results-path ${RESULTS_PATH}/audio_gen --dur-prediction
Coming soon.
Please cite our paper if you find this repository helpful in your research:
@inproceedings{du2023minetrans,
title = {The {M}ine{T}rans Systems for {IWSLT} 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks},
author = {Du, Yichao and Zhengsheng, Guo and Tian, Jinchuan and Zhang, Zhirui and Wang, Xing and Yu, Jianwei and Tu, Zhaopeng and Xu, Tong and Chen, Enhong},
booktitle = {Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)},
year = {2023},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2023.iwslt-1.3},
pages = {79--88},
}