AV2AV

Official PyTorch implementation for the following paper:

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024
[Paper] [Demo]

Method

Setup

Python >=3.7,<3.11

git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge

Dataset

Name	Language	Link
LRS3	English	here
mTEDx	Spanish, French, Italian, and Portuguese	here

We use curated lists of this work for filtering mTEDx.
For more details, please refer to the 'Dataset' section in our paper.

Data Preprocessing

We follow Auto-AVSR to preprocess audio-visual data.

Model Checkpoints

Stage	Download Link
AV Speech Unit Extraction	mavhubert_large_noise.pt
Multilingual AV2AV Translation	utut_sts_ft.pt
Zero-shot AV-Renderer	unit_av_renderer.pt

Inference

Pipeline for Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV)

$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
  --in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
  --out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
  --src-lang en --tgt-lang es \
  --av2unit-path /path/to/mavhubert_large_noise.pt \
  --utut-path /path/to/utut_sts_ft.pt \
  --unit2av-path /path/to/unit_av_renderer.pt \

Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)

Acknowledgement

This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.

Citation

If our work is useful for your research, please consider citing the following papers:

@article{choi2023av2av,
  title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
  author={Jeongsoo Choi and Se Jin Park and Minsu Kim and Yong Man Ro},
  journal={arXiv preprint arXiv:2312.02512},
  year={2023}
}
@article{kim2023many,
    title={Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation},
    author={Minsu Kim and Jeongsoo Choi and Dahun Kim and Yong Man Ro},
    journal={arXiv preprint arXiv:2308.01831},
    year={2023}
}

ms-dot-k/av2av