We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.
More examples can be found in https://nast-s2x.github.io/.
Chunk Size 320ms | Chunk Size 2560ms | Offline |
---|---|---|
CS_320ms.mp4 |
CS_2560ms.mp4 |
Offline.mp4 |
Source Speech Transcript | Reference Text Translation |
---|---|
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné. | before the fusion of the towns rouge thier was a part of the town of louveigne |
Note
For more examples, please check https://nast-s2x.github.io/.
- (2024/06/27) We have created a tutorial to guide you through preprocessing the data and running NAST-S2X on your machine. You can find it at this URL.
- We have published our paper on arXiv, available at https://arxiv.org/abs/2406.06937.
- We have released the checkpoints and datasets for reference at Hugging Face🤗.
- 🤖 An end-to-end model without intermediate text decoding
- 💪 Supports offline and streaming decoding of all modalities
- ⚡️ 28× faster inference compared to autoregressive models
- ⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
- 👩💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
- 🤖 Unified Framework: Support end-to-end text & speech generation in one model
Check Details 👇
Offline-S2S | Simul-S2S | Simul-S2T |
---|---|---|
- Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
- Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
- End-to-End: Generate target speech without target text decoding.
Note
We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.
Chunk Size | checkpoint | ASR-BLEU | ASR-BLEU (Silence Removed) | Average Lagging |
---|---|---|---|---|
320ms | checkpoint | 19.67 | 24.90 | -393ms |
1280ms | checkpoint | 20.20 | 25.71 | 3330ms |
2560ms | checkpoint | 24.88 | 26.14 | 4976ms |
Offline | checkpoint | 25.82 | - | - |
Vocoder |
---|
checkpoint |
Warning
Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.
- Data preprocessing: Follow the instructions in the document.
- Generate Acoustic Unit: Execute
offline_s2u_infer.sh
- Generate Waveform: Execute
offline_wav_infer.sh
- Evaluation: Using Fairseq's ASR-BLEU evaluation toolkit
- We use our customized fork of
SimulEval: b43a7c
to evaluate the model in simultaneous inference. This repository is built upon the officialSimulEval: a1435b
and includes additional latency scorers. - Data preprocessing: Follow the instructions in the document.
- Streaming Generation and Evaluation: Execute
streaming_infer.sh
Generally, you should use these CLI commands for different purposes.
Training Stage | CLI Commands |
---|---|
ASR Pretrain | --arch nonautoregressive_streaming_speech_transformer_segment_to_segment --task nat_speech_to_text_ctc_modified --criterion nat_loss_ngram_glat_asr |
Speech-to-Unit Training | --arch nonautoregressive_streaming_speech_to_unit_transformer_segment_to_segment --task nat_speech_to_unit_ctc_modified --criterion nat_loss_ngram_glat_s2u |
Speech-to-Text Training | --arch nonautoregressive_streaming_speech_transformer_segment_to_segment --task nat_speech_to_text_ctc_modified --criterion nat_loss_ngram_glat |
The detailed training scripts are provided for your reference.
- Data preprocessing: Follow the instructions in the document.
- Encoder Pretraining: Execute
pretrain_encoder.sh
- CTC Pretraining: Execute
train_ctc.sh
- NMLA Training: Execute
train_nmla.sh
Please kindly cite us if you find our papers or codes useful.
@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}
@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}