This repository is the official implementation of Face-TTS.
- Install python packages
pip install -r requirements.txt
- Build monotonic align module
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..
-
Download trained model weights from here
-
Download LRS3 into
'data/lrs3/'
-
Extract and save audio as '*.wav' files in
'data/lrs3/wav'
python data/extract_audio.py
❗ Face should be cropped and aligned for LRS3 distribution. You can use 'syncnet_python/detectors'.
- Prepare text description in txt file.
echo "This is test" > test/text.txt
- Inference Face-TTS.
python inference.py
- Result will be saved in
'test/'
.
⚡ To make MOS test set, we use 'test/ljspeech_text.txt' to randomly select text description.
-
Check config.py
-
Run
python run.py
This repo is based on Grad-TTS, HiFi-GAN-16k, SyncNet. Thanks!
@inproceedings{lee2023imaginary,
author = {Lee, Jiyoung and Chung, Joon Son and Chung, Soo-Whan},
title = {Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech},
booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
year = {2023},
}
Face-TTS
Copyright (c) 2023-present NAVER Cloud Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.