NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Speech Translation Model

Speech-to-Speech Demo

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

More examples can be found in https://nast-s2x.github.io/.

Chunk Size 320ms	Chunk Size 2560ms	Offline
CS_320ms.mp4	CS_2560ms.mp4	Offline.mp4

Source Speech Transcript	Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné.	before the fusion of the towns rouge thier was a part of the town of louveigne

Note

For more examples, please check https://nast-s2x.github.io/.

News🔥

(2024/06/27) We have created a tutorial to guide you through preprocessing the data and running NAST-S2X on your machine. You can find it at this URL.
We have published our paper on arXiv, available at https://arxiv.org/abs/2406.06937.
We have released the checkpoints and datasets for reference at Hugging Face🤗.

Features

🤖 An end-to-end model without intermediate text decoding
💪 Supports offline and streaming decoding of all modalities
⚡️ 28× faster inference compared to autoregressive models

Performance

⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S	Simul-S2S	Simul-S2T

Architecture

Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

Note

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size	checkpoint	ASR-BLEU	ASR-BLEU (Silence Removed)	Average Lagging
320ms	checkpoint	19.67	24.90	-393ms
1280ms	checkpoint	20.20	25.71	3330ms
2560ms	checkpoint	24.88	26.14	4976ms
Offline	checkpoint	25.82	-	-

Vocoder
checkpoint

Inference

Warning

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Data preprocessing: Follow the instructions in the document.
Generate Acoustic Unit: Execute offline_s2u_infer.sh
Generate Waveform: Execute offline_wav_infer.sh
Evaluation: Using Fairseq's ASR-BLEU evaluation toolkit

Simultaneous Inference

We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
Data preprocessing: Follow the instructions in the document.
Streaming Generation and Evaluation: Execute streaming_infer.sh

Train your own NAST-S2X

Generally, you should use these CLI commands for different purposes.

Training Stage	CLI Commands
ASR Pretrain	`--arch nonautoregressive_streaming_speech_transformer_segment_to_segment` `--task nat_speech_to_text_ctc_modified` `--criterion nat_loss_ngram_glat_asr`
Speech-to-Unit Training	`--arch nonautoregressive_streaming_speech_to_unit_transformer_segment_to_segment` `--task nat_speech_to_unit_ctc_modified` `--criterion nat_loss_ngram_glat_s2u`
Speech-to-Text Training	`--arch nonautoregressive_streaming_speech_transformer_segment_to_segment` `--task nat_speech_to_text_ctc_modified` `--criterion nat_loss_ngram_glat`

The detailed training scripts are provided for your reference.

Data preprocessing: Follow the instructions in the document.
Encoder Pretraining: Execute pretrain_encoder.sh
CTC Pretraining: Execute train_ctc.sh
NMLA Training: Execute train_nmla.sh

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}

@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}

ictnlp/NAST-S2x