Official PyTorch implementation for the following paper:
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, Yong Man Ro
CVPR 2024
[Paper] [Demo]
- Python >=3.7,<3.11
git clone -b main --single-branch https://github.com/choijeongsoo/av2av
cd av2av
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
conda install "ffmpeg<5" -c conda-forge
Name | Language | Link |
---|---|---|
LRS3 | English | here |
mTEDx | Spanish, French, Italian, and Portuguese | here |
- We use curated lists of this work for filtering mTEDx.
- For more details, please refer to the 'Dataset' section in our paper.
- We follow Auto-AVSR to preprocess audio-visual data.
Stage | Download Link |
---|---|
AV Speech Unit Extraction | mavhubert_large_noise.pt |
Multilingual AV2AV Translation | utut_sts_ft.pt |
Zero-shot AV-Renderer | unit_av_renderer.pt |
$ cd av2av
$ PYTHONPATH=fairseq python inference.py \
--in-vid-path samples/en/TRajLqEaWhQ_00002.mp4 \
--out-vid-path samples/es/TRajLqEaWhQ_00002.mp4 \
--src-lang en --tgt-lang es \
--av2unit-path /path/to/mavhubert_large_noise.pt \
--utut-path /path/to/utut_sts_ft.pt \
--unit2av-path /path/to/unit_av_renderer.pt \
- Our model supports 5 languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese)
This repository is built upon AV-HuBERT, UTUT, speech-resynthesis, Wav2Lip, and Fairseq. We appreciate the open-source of the projects.
If our work is useful for your research, please consider citing the following papers:
@article{choi2023av2av,
title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
author={Jeongsoo Choi and Se Jin Park and Minsu Kim and Yong Man Ro},
journal={arXiv preprint arXiv:2312.02512},
year={2023}
}
@article{kim2023many,
title={Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation},
author={Minsu Kim and Jeongsoo Choi and Dahun Kim and Yong Man Ro},
journal={arXiv preprint arXiv:2308.01831},
year={2023}
}