Dependency

Linux Ubuntu 16.04
Python 3.5

Tensorflow-gpu 1.2.1
Numpy
Soundfile
PyWorld
- Cython

Setting up the environment

For example,

conda create -n py35tf121 -y python=3.5
source activate py35tf121
pip install -U pip
pip install -r requirements.txt

Note:

soundfile might require sudo apt-get install.
You can use any virtual environment packages (e.g. virtualenv)
If your Tensorflow is the CPU version, you might have to replace all the NCHW ops in my code because Tensorflow-CPU only supports NHWC op and will report an error: InvalidArgumentError (see above for traceback): Conv2DCustomBackpropInputOp only supports NHWC.
I recommend installing Tensorflow from the link on their Github repo.
pip install -U [*.whl link on the Github page]

Usage

Run bash download.sh to prepare the VCC2016 dataset.
Run analyzer.py to extract features and write features into binary files. (This takes a few minutes.)
Run build.py to record some stats, such as spectral extrema and pitch.
To train a VAE, for example, run

python main.py \
--model ConvVAE \
--trainer VAETrainer \
--architecture architecture-vae-vcc2016.json

You can find your models in ./logdir/train/[timestamp]
To convert the voice, run

python convert.py \
--src SF1 \
--trg TM3 \
--model ConvVAE \
--checkpoint logdir/train/[timestamp]/[model.ckpt-[id]] \
--file_pattern "./dataset/vcc2016/bin/Testing Set/{}/*.bin"

*Please fill in timestampe and model id.
7. You can find the converted wav files in ./logdir/output/[timestamp]

Dataset

Voice Conversion Challenge 2016 (VCC2016): download page

Model

Conditional VAE

File/Folder

dataset
  vcc2016
    bin
    wav
      Training Set
      Testing Set
        SF1
        SF2
        ...
        TM3
etc
  speakers.tsv  (one speaker per line)  
  (xmax.npf)  
  (xmin.npf)  
util (submodule)
model
logdir
architecture*.json

analyzer.py    (feature extraction)
build.py       (stats collecting)
trainer*.py
main.py        (main script)
(validate.py)  (output converted spectrogram) 
convert.py     (conversion)

Binary data format

The WORLD vocdoer features and the speaker label are stored in binary format.
Format:

[[s1, s2, ..., s513, a1, ..., a513, f0, en, spk],
 [s1, s2, ..., s513, a1, ..., a513, f0, en, spk],
 ...,
 [s1, s2, ..., s513, a1, ..., a513, f0, en, spk]]

where
s_i is spectral envelop magnitude (in log10) of the ith frequency bin,
a_i is the corresponding "aperiodicity" feature,
f0 is the pitch (0 for unvoice frames),
en is the energy,
spk is the speaker index (0 - 9) and s is the sp.

Note:

The speaker identity spk was stored in np.float32 but will be converted into tf.int64 by the reader in analysizer.py.
I shouldn't have stored the speaker identity per frame; it was just for implementation simplicity.

Modification Tips

Define a new model (and an accompanying trainer) and then specify the --model and --trainer of main.py.
Tip: when creating a new trainer, override _optimize() and the main loop in train().
Code orgainzation

This isn't a UML; rather, the arrows indicates input-output relations only.

Difference from the original paper

WORLD vocoder is chosen in this repo instead of STRAIGHT because the former is open-sourced whereas the latter isn't.
I use pyworld, Python wrapper of the WORLD, in this repo.
Global variance post-filtering was not included in this repo.
In our VAE-NPVC paper, we didn't apply the [-1, 1] normalization; we did in our VAWGAN-NPVC paper.

About

The original code base was originally built in March, 2016.
Tensorflow was in version 0.10 or earlier, so I decided to refactor my code and put it in this repo.

TODO

util submodule (add to README)
GV
build.py should accept subsets of speakers

JeremyCCHsu/vae-npvc

Dependency

Setting up the environment

Note:

Usage

Dataset

Model

File/Folder

Binary data format

Modification Tips

Difference from the original paper

About

TODO