/TranSpeech

PyTorch Implementation of TranSpeech (ICLR'23): Textless NAR Speech-to-Speech Translation with Bilateral Perturbation

Primary LanguagePythonMIT LicenseMIT

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Rongjie Huang*, Jinglin Liu*, Huadai Liu*, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao | Zhejiang University, ByteDance

PyTorch Implementation of TranSpeech (ICLR'23): a speech-to-speech translation model towards high-accuracy and non-autoregressive translation.

arXiv GitHub Stars visitors

We provide our implementation and pretrained models in this repository.

Visit our demo page for audio samples.

🔥 News

TranSpeech is one of our continuous efforts to reduce communication barrier.

Dependencies

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install fairseq version 1.0.0a0 and develop locally:
pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Train your own model

Data preparation

  1. Prepare two folders, $SRC_AUDIO and $TGT_AUDIO, with ${SPLIT}/${SAMPLE_ID}.wav for source and target speech under each folder, separately. Note that for S2UT experiments, target audio sampling rate should be in 16,000 Hz, and for S2SPECT experiments, target audio sampling rate is recommended to be in 22,050 Hz.
  2. To prepare target discrete units for S2UT model training, see Generative Spoken Language Modeling (speech2unit) for pre-trained k-means models, checkpoints, and instructions on how to decode units from speech. Set the output target unit files (--out_quantized_file_path) as ${TGT_AUDIO}/${SPLIT}.txt. In Lee et al. 2021, we use 100 units from the sixth layer (--layer 6) of the HuBERT Base model.

Bilateral Perturbation

1. Prepare a pretrained Hubert and HifiGAN

Model Pretraining Data Model Quantizer
mHuBERT Base En, Es, Fr speech download L11 km1000
HIFIGAN 16k Universal download
dict.unit.txt download

2. Prepare Perturbated Dataset

Suppose we have original dataset at /path/to/TGT_AUDIO

  • information enhancement: refer to ./hubertCTC/gen_IE.py and generate Dataset S1
python research/TranSpeech/hubertCTC/gen_IE.py --ckpt /path/to/ckpt --wav /path/to/TGT_AUDIO --out /path/to/S2/dataset
  • style normalization: refer to ./hubertCTC/gen_SN.py and generate Dataset S2:
python research/TranSpeech/hubertCTC/gen_SN.py  --wav /path/to/TGT_AUDIO --out /path/to/S1/dataset

3. Prepare Pseudo Text

  • Get Manifest
python examples/wav2vec/wav2vec_manifest.py /path/to/S1/dataset --dest /manifest/to/S1/dataset --ext $ext --valid-percent $valid
python examples/wav2vec/wav2vec_manifest.py /path/to/S2/dataset --dest /manifest/to/S2/dataset --ext $ext --valid-percent $valid

$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read. $valid should be set to some reasonable percentage (like 0.01) of training data to use for validation.

  • Quantize using the learned clusters
MANIFEST=/manifest/to/S2/dataset
OUT_QUANTIZED_FILE=/quantized/to/S2/dataset
For CKPT_PATH & KM_MODEL_PATH, refer to Section 1.

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type hubert \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer 11 \
    --manifest_path $MANIFEST  \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".flac"
  • Prepare {train,valid}.unit
python data/huberts/generate_tunehuberts.py --manifest /manifest/to/S2/dataset --txt /quantized/to/S2/dataset --unit /unit/to/S2/dataset

4. Fine-tune a HuBERT model with a CTC loss

Suppose we have a mHuBERT Base ckpt at /path/to/checkpoint Suppose {train,valid}.tsv are saved at /manifest/to/S1/dataset, and their corresponding character transcripts {train,valid}.unit and dict.unit.txt are saved at /unit/to/S2/dataset.

  • To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run
$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
  --config-name base_10h_change \
  task.data=/manifest/to/S1/dataset task.label_dir=/unit/to/S2/dataset \
  model.w2v_path=/path/to/checkpoint optimization.max_update=70000

Prepare data for training S2UT model

1. Inference with Tuned Huberts

  • Format the audio data.
AUDIO_EXT: audio extension, e.g. wav, flac, etc.
Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}
${GEN_SUBSET} should be train, test, or dev

python examples/speech_to_speech/preprocessing/prep_sn_data.py \
  --audio-dir /path/to/TGT_AUDIO --ext ${AUIDO_EXT} \
  --data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
  --for-inference
  • Run the Tuned Huberts.
mkdir -p ${RESULTS_PATH}

python examples/speech_recognition/new/infer.py \
   --config-dir examples/hubert/config/decode/ \
   --config-name infer_viterbi \
   task.data=${DATA_DIR} \
   task.normalize=false \
   common_eval.results_path=${RESULTS_PATH}/log \
   common_eval.path=${DATA_DIR}/checkpoint_best.pt \
   dataset.gen_subset=${GEN_SUBSET} \
   '+task.labels=["unit"]' \
   +decoding.results_path=${RESULTS_PATH} \
   common_eval.post_process=none \
   +dataset.batch_size=1 \
   common_eval.quiet=True
  • Post-process and generate output at ${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
 --in-unit ${RESULTS_PATH}/hypo.units \
 --in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
 --output-root ${RESULTS_PATH}

2. Formatting Speech-to-Speech Translation data

# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.

python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
  --source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
  --output-root $DATA_ROOT --reduce-unit \
  --vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG

For knowledge distillation, we need another step to format the data from teacher.

Training S2UT model

Here's an example for training nar_s2ut_conformer S2UT models with 1000 discrete units as target:

fairseq-train $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_speech_fasttranslate --target-is-code --target-code-size 1000 --vocoder code_hifigan  \
  --criterion nar_speech_to_unit --label-smoothing 0.2 \
  --arch nar_s2ut_conformer --share-decoder-input-output-embed \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --save-dir ${MODEL_DIR}  --tensorboard-logdir ${MODEL_DIR} \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
  --max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
  --seed 1 --fp16 --num-workers 8 \
  --user-dir research/  --attn-type espnet --pos-enc-type rel_pos 
  • Adjust --update-freq accordingly for different #GPUs. In the above we set --update-freq 4 to simulate training with 4 GPUs.

Inference with NAR S2UT model

  1. Follow the same inference process as in fairseq-S2T to generate unit sequences (${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).
fairseq-generate $DATA_ROOT \
 --gen-subset test --task speech_to_speech_fasttranslate  --path ${MODEL_DIR} \
 --target-is-code --target-code-size 1000 --vocoder code_hifigan   --results-path ${OUTPUT_DIR} \
 --iter-decode-max-iter $N  --iter-decode-eos-penalty 0 --beam 1   --iter-decode-with-beam 15 
  • Noisy decoding: inference with --external-reranker --path ${checkpoint_path} = a:b , where a, b denote the student and AR tracher.
  1. Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
  sed 's/^D-//ig' | sort -nk1 | cut -f3 \
  > ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit

Unit-to-Speech HiFi-GAN vocoder

Unit config Unit size Vocoder language Dataset Model
mHuBERT, layer 11 1000 En LJSpeech ckpt, config
mHuBERT, layer 11 1000 Es CSS10 ckpt, config
mHuBERT, layer 11 1000 Fr CSS10 ckpt, config
python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
  --results-path ${RESULTS_PATH} --dur-prediction

Evaluation

Refer to research/TranSpeech/asr_bleu/README.md

Acknowledgements

This implementation uses parts of the code from the following Github repos: Fairseq, as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@article{huang2022transpeech,
  title={TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation},
  author={Huang, Rongjie and Zhao, Zhou and Liu, Jinglin and Liu, Huadai and Ren, Yi and Zhang, Lichao and He, Jinzheng},
  journal={arXiv preprint arXiv:2205.12523},
  year={2022}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.