Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC). We leverage a memory-based face-voice alignment module for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model VQMIVC to our zero-shot FaceVC model.

Paper Demo

Training

Step1. Data preparation & preprocessing

Put LRS3 corpus under directory "Dataset/LRS3"
Extract wav from LRS3 video

python Tools/preprocess/extract_wav_from_video.py

Extract mel and lf0 from wav

python Tools/preprocess/extract_wav_feature.py

Extract face feature

python Tools/Preprocess/extract_face_feature.py

Extract speech feature

python Tools/Preprocess/extract_spk_emb.py

Step2. Model training

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN
Download the pretrained VQMIVC and place it in folder pretrained
Training model

./run_shell/train.sh

Step3. Inference

Preprocess the samples for inference following Step 1. The IDs of the preprocessed samples can be found in the files "test_src_speakers.txt" and "test_tar_speakers.txt."
Pretrained FVMVC can be found in here
Runing inference

./run_shell/inference.sh

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{10.1145/3581783.3613825,
author = {Sheng, Zheng-Yan and Ai, Yang and Chen, Yan-Nian and Ling, Zhen-Hua},
title = {Face-Driven Zero-Shot Voice Conversion with Memory-Based Face-Voice Alignment},
year = {2023},
isbn = {9798400701085},
url = {https://doi.org/10.1145/3581783.3613825},
doi = {10.1145/3581783.3613825},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {8443–8452},
location = {Ottawa ON, Canada},
}

Acknowledgements:

The voice conversion backbone is borrowed from VQMIVC
The vocoder is borrowed from ParallelWaveGAN

Levent9/Zero-shot-FaceVC

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Training

Citation

Acknowledgements: