/av2av

[CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Primary LanguagePython

AV2AV

Official PyTorch implementation for the following paper:

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024
[Paper] [Demo]

Method

Setup

  • Python >=3.7,<3.11
git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge

Dataset

Name Language Link
LRS3 English here
mTEDx Spanish, French, Italian, and Portuguese here
  • We use curated lists of this work for filtering mTEDx.
  • For more details, please refer to the 'Dataset' section in our paper.

Data Preprocessing

  • We follow Auto-AVSR to preprocess audio-visual data.

Model Checkpoints

Stage Download Link
AV Speech Unit Extraction mavhubert_large_noise.pt
Multilingual AV2AV Translation utut_sts_ft.pt
Zero-shot AV-Renderer unit_av_renderer.pt

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
  --in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
  --out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
  --src-lang en --tgt-lang es \
  --av2unit-path /path/to/mavhubert_large_noise.pt \
  --utut-path /path/to/utut_sts_ft.pt \
  --unit2av-path /path/to/unit_av_renderer.pt \
  • Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)

Acknowledgement

This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.

Citation

If our work is useful for your research, please consider citing the following papers:

@article{choi2023av2av,
  title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
  author={Jeongsoo Choi and Se Jin Park and Minsu Kim and Yong Man Ro},
  journal={arXiv preprint arXiv:2312.02512},
  year={2023}
}
@article{kim2023many,
    title={Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation},
    author={Minsu Kim and Jeongsoo Choi and Dahun Kim and Yong Man Ro},
    journal={arXiv preprint arXiv:2308.01831},
    year={2023}
}