/MINETrans-IWSLT23

Official implementation of our IWSLT 2023 paper "The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks"

Primary LanguagePython

Logo

The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks

This project is the official implementation of the MineTrans English-to-Chinese speech transaltion system for the IWSLT2023 speech-to-speech translation (S2ST) track and the offline speech translation (S2T) track.

🌐 Demo Page • 🤗 HuggingFace Page(Coming soon) • 📃 Paper • 📽️ Slide • ⏬ Data • 🤖 Model

Team: Yichao Du, Zhengsheng Guo, Jinchuan Tian, Zhirui Zhang, Xing Wang, Jianwei Yu, Zhaopeng Tu, Tong Xu, and Enhong Chen


Overview


Setup

git clone https://github.com/duyichao/MineTrans-IWSLT23.git
cd MineTrans-IWSLT23
pip install -e ./fairseq
pip install -r requirements.txt

Speech-to-Speech Translation

Pre-trained Models

Speech Encoder & K-means Model

Language Speech Encoder Block type Model size Dataset KM-Model
En Wav2vec 2.0 Conformer Large Voxpopuli & GigaSS ×
Zh HuBert Transformer Base GigaSS & AISHELL3 layer6.km250

S2UT Model

Models ASR-BLEU ASR-charF Checkpoint
W2V2-CONF-LARGE 27.7 23.4 download
W2V2-CONF-LARGE+T2U 27.8 23.7 download
HUBERT-TRANS-LARGE+T2U 26.2 23.2 download
HUBERT-TRANS-LARGE+T2U* 25.7 22.6 download

Unit HiFi-GAN Vocoder

Unit config Unit size Language Dataset Model
HuBERT Base, layer 6 250 Zh GigaSS-S (200h) d_500000

Data Preparation

Formatting Data

Dataset should be prepared into the following format.

id	audio	n_frames	tgt_text	tgt_n_frames
YOU0000010267_S0001707	/path/to/YOU0000010267_S0001707.wav	49600	44 127 27 66 46	100
YOU0000016336_S0001298	/path/to/YOU0000016336_S0001298.wav	83200	44 239 222 46	202

Inference

  1. Follow the same inference process as in fairseq-S2T to generate units (${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).
CFG=config_u250_s2ut_audio.yaml
CKPT_S2UT=/path/to/checkpoint
RESULTS_PATH=/path/to/results
EVAL_DATA_PATH=/path/to/eval_data
GEN_SUBSET=/path/to/test_data


mkdir ${RESULTS_PATH} -p
CUDA_VISIBLE_DEVICES=1 \
  fairseq-generate ${EVAL_DATA_PATH} \
  --config-yaml ${CFG} \
  --task speech_to_text \
  --path ${CKPT_S2UT} --gen-subset ${GEN_SUBSET} \
  --max-tokens 2000000 --max-source-positions 2000000 --max-target-positions 10000 \
  --beam 10 --max-len-a 1 --max-len-b 200 --lenpen 1 \
  --scoring sacrebleu \
  --required-batch-size-multiple 1 \
  --results-path ${RESULTS_PATH}
  1. Convert unit sequences to waveform through unit-based HiFi-GAN vocoder.
VOCODER_CFG=/path/to/vocoder_cfg
VOCODER_CKPT=/path/to/vocoder_ckpt
  grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt |
    sed 's/^D-//ig' | sort -nk1 | cut -f3 \
      >${RESULTS_PATH}/generate-${GEN_SUBSET}.hyp.unit
  grep "^T\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt |
    sed 's/^D-//ig' | sort -nk1 | cut -f2 \
      >${RESULTS_PATH}/generate-${GEN_SUBSET}.ref.unit

  mkdir ${RESULTS_PATH}/audio_gen -p
  python3 ./minetrans/scripts/generate_waveform_from_code.py \
    --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.hyp.unit \
    --vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
    --results-path ${RESULTS_PATH}/audio_gen --dur-prediction

Offline Speech Translation

Coming soon.


Citation

Please cite our paper if you find this repository helpful in your research:

@inproceedings{du2023minetrans,
    title = {The {M}ine{T}rans Systems for {IWSLT} 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks},
    author = {Du, Yichao and Zhengsheng, Guo and Tian, Jinchuan and Zhang, Zhirui and Wang, Xing and Yu, Jianwei and Tu, Zhaopeng  and Xu, Tong  and Chen, Enhong},
    booktitle = {Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)},
    year = {2023},
    publisher = {Association for Computational Linguistics},
    url = {https://aclanthology.org/2023.iwslt-1.3},
    pages = {79--88},

}