This repo contains the code for the project undertaken
as part of the course Project Work in Artificial Intelligence at the Technical University of Denmark.
For English voice conversion the VCTK Multi-speaker Corpus was used for preliminary experiments.
For Danish voice conversion the Språkbankens ressurskatalog is used.
To run the full project, you need GPU hardware. This project has been run using the gpuv100 server on the DTU HPC Server.
This project specifically looks at, to which degree voice conversion technologies can be utilized fortransforming dialect heavy speech into a standard voice, to improve the performance of the existingDanish state-of-the-art speech to text system danspeech.
- How well can state-of-the-art voice conversion results from StarGAN-VC and Instance-normalization VC models be reproduced for many-to-one, zero-shot voice conversion scenarios?
- How does the danspeech speech to text translation perform when applying voice conversion models compared to using no voice conversion?
- How does the danspeech speech to text translation perform when voice converted input isprovided to a pretrained danspeech model compared to a danspeech model retrained on voiceconverted data?
The architecture incorperates voice conversion as part of training the STT model and as an added step in the speech to text process to try and create a common voice with better speech to text translation accuracy.
The repository is structured as follows
The following models are used with custom implementations, such that the preprocessing of data and training of the models are customized to the specific topic of this research project. The two VC models used can be found in the /vc folder
- StarGAN
- Implementation of the StarGAN Voice Conversion Project from the [StarGAN-VC Paper] (https://arxiv.org/abs/1806.02169).
- SSCR
- Implementation of the One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization from the Voice Conversion Paper.
IMPORTANT: Due to limitations in the code, you have to create the following files manually in the preprocessed folder before running the preprocess. attr.pkl, in_test.pkl, out_test.pkl, train.pkl. (maybe others as well. The error will show. Sorry for that
Module to preprocess wav file data for the StarGAN and SSCR models to consume.
- VCTKPreprocessing DISCLAIMER - The preprocessing of VCTK is not supported as Spraakbanken is the focus of this project
- Preprocessing of data from the VCTK Multi-speaker Corpus.
- DanishPreprocessing
- Preprocessing of Danish multi speaker data from the Norwegian National Library Språkbankens ressurskatalog.
The following models are used for the transformation of original and voice converted speech data to produce a text signal which is used to evalute the speech recognition accuracy when using voice conversion as opposed to no conversion.
- danspeech
This module includes code for evaluating and comparing the accuracy of the Speech to Text Models for original and voice converted speech input.
- McNemar
- Comparing the final text out put from each speech to text framework when voice input is converted or not using the statistical McNemar's test
- Word Error Rate
- Comparing the final text out put from each speech to text framework when voice input is converted or not using the performance metric Word Error Rate.
To move files from a Spraakbanken folder structure (nested structure with StasjonXX folders) use the script preprocess/spraakbanken/files.py. Meta .json files will also be created for each speaker in the process.
files.py -data_dir <Spraakbanken Directory Path> -out_dir <Path to create the new file structure>
There are three runnable scripts, which have been used for this project. These are modified of the StarGAN cloned scripts preprocess.py, main.py and convert.py:
- /fagprojekt2020/preprocess/stargan/stargan_preprocess_spraakbanken.py
- This script is build to preprocess Spraakbanken speaker data for later training of StarGAN.
- The script converts 48 kHz wav audio files to 16 kHz, unless data is already 16 kHz. It then extracts the acoustic features (MCEPs, F0) and compute the corresponding stats (means, stds) and saves these to */mc/train and */mc/test.
- The different speakers must be in seperate folders and speaker_used must designate which to prerpocess.
- Run in terminal to preprocess speaker_used list inside the script.
python3 stargan_preprocess_spraakbanken.py
- /fagprojekt2020/vc/stargan/main_spraakbanken.py
- This script is used for training the StarGAN-VC model.
- List of speakers included in training needs to be designated in /fagprojekt2020/vc/stargan/data_loader.py. These have to be preprocessed.
- Directories can be designated in terminal using the parser, but this is more easely done in the script. Change the defaults to accomodate for training data placement and designated folder for saving models.
- Run in terminal designating number of training speakers at xx.
python3 main_spraakbanken.py --num_speakers xx
- /fagprojekt2020/vc/stargan/convertnew.py
- This script is used for converting .wav files directly, meaning preprocess and converting happens together and no training data for StarGAN model training is produces.
- The default iteration model used is 200,000 and its path is needed.
- The mc folder needs to include the _stats.npz file of target speaker, and the same list of training speakers used for training the model need to be included under the class TestDataset as self.speakers.
- Every speaker included in the origin_wavpath will be converted.
- Run in terminal
python3 convertnew.py
As with StarGAN there are three runnable scripts which have been used for running VAE in this project.
-
preprocess/spraakbanken/vae/run.sh
- When running the preprocessing the provided python virtual environment preprocess_vae can be used to avoid installing a lot of modules in specific versions.
- The script preprocess .wav files into a format, that can be used by VAE for training. It is important that the script is run on audio data, that is placed in a folder structure adhearing to the one created by /preprocess/spraakbanken/files.py
- Configurations are made in the /preprocess/spraakbanken/vae/preprocess.config file. The most important are:
- segment_size: How many segments the .wav files should at least contain. .wav files with fewer segments will be filtered out.
- data_dir: The directory where the preprocessed data will be written to
- raw_data_dir: The directory containing the speaker data to preprocess (must follow structure, see above)
- training_samples: How many segments to randomly sample from the .wav files to use for training
- Finally run the script
sh run.sh
-
vc/vae/train.sh
- Used for training the VAE model. Run it using the train_vae as python virtual environment to avoid installing a lot of python packages.
train.sh -d <location of preprocessed data> -train_set <leave as is> -train_index_file <leave as is> -store_model_path <location of where to save the trained model and dependencies> -t <name of tensorboard log folder> -iters <training iterations> -summary_step <how often to save a log of the training loss to tensorboard>
-
vc/vae/infer.sh
- Used to perform conversions of unseen source speakers to a target speaker
- source_folder: Folder containing source speakers to convert
- target: location of specific .wav file for the target speaker. This is used for the conversion. Better quality = better conversion.
- inference.py arguments:
- -a: directory of attr.pkl file, should be in the preprocessed data folder
- -c: location of model.config.yaml file, should be in the trained model folder
- -m: location of model.ckpt file, should be in the trained model folder
- -s: don't change. It is set from source_folder
- -t: don't change. It is set from target
- -o: output directory. You have to create the directory.
infer.sh
- Used to perform conversions of unseen source speakers to a target speaker