This repository contains code for video-to-speech conversion. For more information, please see our EUSIPCO 2021 paper (available on arXiv):
Dan Oneață, Adriana Stan, Horia Cucu. Speaker disentanglement in video-to-speech conversion. EUSIPCO, 2021.
Qualitative samples are available here.
Installation steps:
conda env create -f environment.yml
conda activate xts
pip install -r requirements.txt
Note: Depending on your GPU, you may need to specify different versions for cudatoolkit
and Pytorch
in the environment.yml
configuration file.
Clone the Tacotron2 repository:
git clone https://github.com/NVIDIA/tacotron2.git
We describe how the code and data are organized in the repository.
Code. The code is organized as follows:
train.py
is the main script, which trains video-to-speech models.train_dispel.py
andtrain_revgrad.py
are used to train models that dispel the speaker identity from the visual features.train_asr_clf.py
andtrain_speaker_clf.py
train linear probes in the visual feature space.hparams.py
contain hyper-parameter configurations.audio.py
contains audio-processing functionality, e.g. extracting Mel spectrograms.models/
contain video-to-speech architectures (video decoders and audio decoders).src/
contains data structures that wrap datasets.evaluate/
implement the evaluation metrics (PESQ, MCD, STOI, WER).scripts/
contain mostly scripts to run experiments or process data.data/
is where the datasets are stored (i.e., videos, audio, face landmarks, speaker embeddings).
Data.
The data
folder contains a folder for each audio-visual dataset,
which in turn contains sub-folders for the different modalities,
the most important being audio, face landmarks, file-lists, speaker embeddings, video.
An example directory structure for the GRID dataset is the following:
data/
└── grid
├── audio-from-video
├── face-landmarks
├── filelists
├── speaker-embeddings
└── video
The path names are set by the PathLoader
from src/dataset.py
and they can vary from dataset to dataset.
We provide a data bundle (video, audio, face landmarks, speaker embeddings) for a speaker in GRID (the speaker s1
).
You can download the data from here and extract it locally in the folder containing the code:
wget "https://sharing.speed.pub.ro/owncloud/index.php/s/U1xmWRLc985A12m/download" -O grid-s1.zip
unzip grid-s1.zip
To train our baseline model just run the following command:
python train.py --hparams magnus -d grid --filelist k-s01 -v
- Set paths to video, for example in
data/$DATASET/video
- Extract middle frame of each video using
scripts/extract_middle_frame.py
- Extract face landmarks from the middle frame using
scripts/detect_face_landmarks_folder.py
- Extract speaker embeddings using
scripts/extract_speaker_embeddings
- video to mel-spectrogram
python predict.py -m magnus --model-path output/models/grid_multi-speaker_magnus.pth -d grid --filelist multi-speaker -v -o output/predictions/grid-multi-test-magnus.npz
- mel-spectrogram to WAV:
# ~/work/dc-tts-xts
# source venv/bin/activate
python synthesize_spectro.py ~/work/xts/output/predictions/grid-multi-test-magnus.npz
To evaluate the intelligibility of the synthesized speech, we used an automatic speech recognition (ASR) system. The ASR is based on Kaldi and trained on the TED-LIUM dataset. For evaluation, we constrained the language model to GRID's vocabulary by using a finite state grammar constructed from the sentences in GRID.
The finite state grammar used to constrain the language model
<command> = bin | lay | place | set;
<color> = blue | green | red | white;
<preposition> = at | by | in | with;
<letter> = a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | x | y | z;
<digit> = zero | one | two | three | four | five | six | seven | eight | nine;
<adverb> = again | now | please | soon;
public <utterance> = <command> <color> <preoposition> <letter> <digit> <adverb>;
To replicate our results, you need to follow these steps:
- Install Kaldi
- Download our models and scripts and extract them locally:
unzip xts-asr.zip
- Set up the path to Kaldi in
xts-asr/path.sh
; for example:
export KALDI_ROOT=/home/doneata/src/kaldi
- Link to the
steps
andutils
folders from Kaldi inxts-asr
; for example:
ln -s /home/doneata/src/kaldi/egs/wsj/s5/steps steps
ln -s /home/doneata/src/kaldi/egs/wsj/s5/utils utils
- Run an evaluation by using the
xts-asr/run.sh
script:
bash run.sh --dset tiny
- To define a new dataset, you will need to prepare the files
wav.scp
,text
,utt2spk
andspk2utt
. For an example see the files inxts-asr/data/grid/tiny
. For more information, please consult the Kaldi documentation.