DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

Overview

This repository contains a video demo of IEEE/CVF International Conference on Computer Vision (ICCV) 2023 paper titled "DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding".

Demo video

A demo video contains the original speech, the generated speech from previous state-of-the-art works [1-3], and the generated speech from the proposed method from three different speakers on both LRS2 and LRS3 datasets, respectively. The video demo is located in demo-video folder in our repository, and it is also available in Youtube:

References

[1] Kim, Minsu, Joanna Hong, and Yong Man Ro. "Lip to speech synthesis with visual context attentional GAN." Advances in Neural Information Processing Systems 34 (2021): 2758-2770.

[2] Mira, Rodrigo, et al. "SVTS: scalable video-to-speech synthesis." arXiv preprint arXiv:2205.02058 (2022).

[3] Kim, Minsu, Joanna Hong, and Yong Man Ro. "Lip-to-speech synthesis in the wild with multi-task learning." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.