/NAST-S2x

A fast speech-to-speech & speech-to-text translation model that supports simultaneous decoding and offers 28× speedup.

Primary LanguagePython

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

arXiv model

Speech-to-Speech Demo

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

More examples can be found in https://nast-s2x.github.io/.

Chunk Size 320ms Chunk Size 2560ms Offline
CS_320ms.mp4
CS_2560ms.mp4
Offline.mp4
Source Speech Transcript Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné. before the fusion of the towns rouge thier was a part of the town of louveigne

Note

For more examples, please check https://nast-s2x.github.io/.

News🔥

  • (2024/06/27) We have created a tutorial to guide you through preprocessing the data and running NAST-S2X on your machine. You can find it at this URL.
  • We have published our paper on arXiv, available at https://arxiv.org/abs/2406.06937.
  • We have released the checkpoints and datasets for reference at Hugging Face🤗.

Features

  • 🤖 An end-to-end model without intermediate text decoding
  • 💪 Supports offline and streaming decoding of all modalities
  • ⚡️ 28× faster inference compared to autoregressive models

Performance

  • ⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
  • 👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
  • 🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S Simul-S2S Simul-S2T
image image image

Architecture

  • Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
  • Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
  • End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

Note

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size checkpoint ASR-BLEU ASR-BLEU (Silence Removed) Average Lagging
320ms checkpoint 19.67 24.90 -393ms
1280ms checkpoint 20.20 25.71 3330ms
2560ms checkpoint 24.88 26.14 4976ms
Offline checkpoint 25.82 - -
Vocoder
checkpoint

Inference

Warning

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Simultaneous Inference

  • We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
  • Data preprocessing: Follow the instructions in the document.
  • Streaming Generation and Evaluation: Execute streaming_infer.sh

Train your own NAST-S2X

Generally, you should use these CLI commands for different purposes.

Training Stage CLI Commands
ASR Pretrain --arch nonautoregressive_streaming_speech_transformer_segment_to_segment
--task nat_speech_to_text_ctc_modified
--criterion nat_loss_ngram_glat_asr
Speech-to-Unit Training --arch nonautoregressive_streaming_speech_to_unit_transformer_segment_to_segment
--task nat_speech_to_unit_ctc_modified
--criterion nat_loss_ngram_glat_s2u
Speech-to-Text Training --arch nonautoregressive_streaming_speech_transformer_segment_to_segment
--task nat_speech_to_text_ctc_modified
--criterion nat_loss_ngram_glat

The detailed training scripts are provided for your reference.

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}
@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}