GAN_mapping_relationship

This is the implementation of our paper. In this paper, we proposed an unsupervised phoneme recogntion system which can achieve 36% phoneme accuracy on TIMIT with oracle phone boundaries. This method developed a GAN-based model to achieve unsupervised phoneme recognition.

How to use

Dependencies

tensorflow 1.13
kaldi
librosa

Data preprocess

Usage:

Modify path.sh with your path of Kaldi.
Modify config.sh with your feature path and timit path.
Run $ bash preprocess.sh

Phoneme sequences can download from here, and put fake.39 and oracle.39 in ./data.

Train model

Usage:

Modify the experimental and path setting in config.sh.
Modify the model's parameter in src/audio2vec.sh and src/mapping.sh.
Run $ bash run.sh

This scipt contains the training flow of the whole system.

Hyperparameters in `config.sh`

cluster_num : number of cluster.

target_type : type of phoneme sequences (oracle/fake).

Hyperparameters in `src/audio2vec.sh`

mode : train or test mode (train/test), test mode is the step of clustering.

lr : learning rate.

max_length : max length of acoustic token.

hidden_units : hidden size of the audio2vec.

batch_size : batch size.

epoch : number of training epoch.

kl_saturate : parameter of kl-annealing.

kl_step : parameter of kl-annealing, which means how many step of KL-weight from 0 to 1.

cuda_id : GPU ids.

Hyperparameters in `src/mapping.sh`

mode : train or test mode (train/test).

generator_lr : learning rate of generator.

discriminator_lr : learning rate of discriminator.

max_length : max length of phoneme sequence.

step : number of training step.

discriminator_hidden_units : hidden size of the discriminator.

discriminator_iterations : training iteration of discriminator.

batch_size : batch size.

cuda_id : GPU ids.

Reference

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings, Da-Rong Liu, Kuan-Yu Chen et.al.

Acknowledgement

Special thanks to Da-Rong Liu !