This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC). We leverage a memory-based face-voice alignment module for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model VQMIVC to our zero-shot FaceVC model.
- Step1. Data preparation & preprocessing
- Put LRS3 corpus under directory "Dataset/LRS3"
- Extract wav from LRS3 video
python Tools/preprocess/extract_wav_from_video.py
- Extract mel and lf0 from wav
python Tools/preprocess/extract_wav_feature.py
- Extract face feature
python Tools/Preprocess/extract_face_feature.py
- Extract speech feature
python Tools/Preprocess/extract_spk_emb.py
- Step2. Model training
-
ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN
-
Download the pretrained VQMIVC and place it in folder pretrained
-
Training model
./run_shell/train.sh
- Step3. Inference
-
Preprocess the samples for inference following Step 1. The IDs of the preprocessed samples can be found in the files "test_src_speakers.txt" and "test_tar_speakers.txt."
-
Pretrained FVMVC can be found in here
-
Runing inference
./run_shell/inference.sh
If the code is used in your research, please Star our repo and cite our paper:
@inproceedings{10.1145/3581783.3613825,
author = {Sheng, Zheng-Yan and Ai, Yang and Chen, Yan-Nian and Ling, Zhen-Hua},
title = {Face-Driven Zero-Shot Voice Conversion with Memory-Based Face-Voice Alignment},
year = {2023},
isbn = {9798400701085},
url = {https://doi.org/10.1145/3581783.3613825},
doi = {10.1145/3581783.3613825},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {8443–8452},
location = {Ottawa ON, Canada},
}
- The voice conversion backbone is borrowed from VQMIVC
- The vocoder is borrowed from ParallelWaveGAN