Unsupervised Acoustic Word Embeddings on Buckeye English and NCHLT Xitsonga

Overview

Note: An updated version of this recipe is available at https://github.com/kamperh/recipe_bucktsong_awe_py3. This updated recipe is implemented in Python 3 (instead of Python 2.7) and uses LibROSA for feature extraction (instead of HTK).

Unsupervised acoustic word embedding (AWE) approaches are implemented and evaluated on the Buckeye English and NCHLT Xitsonga speech datasets. The experiments are described in:

H. Kamper, "Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models," in Proc. ICASSP, 2019. [arXiv]

Please cite this paper if you use the code.

Disclaimer

The code provided here is not pretty. But I believe that research should be reproducible. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.

Datasets

Portions of the Buckeye English and NCHLT Xitsonga corpora are used. The whole Buckeye corpus will be required to execute the steps here, and the portion of the NCHLT data. These can be downloaded from:

Buckeye corpus: buckeyecorpus.osu.edu
NCHLT Xitsonga portion: www.zerospeech.com. This requires registration for the challenge.

From the complete Buckeye corpus we split off several subsets. The most important are the sets labelled as devpart1 and zs in the code here. These sets respectively correspond to English1 and English2 in Kamper et al., 2016, so see the paper for more details. More details of which speakers are found in which set is also given at the end of features/readme.md. We use the entire Xitsonga dataset provided as part of the Zero Speech Challenge 2015 (this is already a subset of the NCHLT data).

Download all these datasets beforehand. These can be stored apart from the code.

Clone the repository

Clone the repository by running:

git clone https://github.com/kamperh/bucktsong_awe

Move into the repository directory:

cd bucktsong_awe

Docker

This recipe comes with Dockerfiles which can be used to build images containing all of the required dependencies. This recipe can be completed without using Docker, but using the image makes it easier to resolve dependencies. To use the docker image you need to first:

Install Docker and follow the post installation steps.
Install nvidia-docker.

The one dependency for building the image is HTK. Download the file HTK-3.4.1.tar.gz from their website and copy this into the docker directory.

Then, to build a docker image, run the following:

cd docker
docker build -f Dockerfile.gpu -t tf-htk .
cd ..

All the rest of the steps can be run in a container in interactive mode. You will need to mount the dataset directories. To run the container in interactive mode with the mounted directories, run:

docker run --runtime=nvidia \
    -v /r2d2/backup/endgame/datasets/buckeye:/data/buckeye \
    -v /r2d2/backup/endgame/datasets/zrsc2015/xitsonga_wavs:/data/xitsonga_wavs \
    -v "$(pwd)":/home -it -p 8887:8887 tf-htk

Alternatively, simply run ./docker.sh, which executes the above command and starts an interactive container.

Preliminaries

If you are not using the docker image, install all the standalone dependencies (see Dependencies section below). Then follow the steps here. The docker image includes all these dependencies and GitHub repositories.

Clone the required GitHub repositories into ../src/ as follows:

mkdir ../src/  # not necessary using docker
git clone https://github.com/kamperh/speech_dtw.git ../src/speech_dtw/

Build the speech_dtw tools by running:

cd ../src/speech_dtw
make
make test
cd -

For speech_dtw you need to run make to build. Unit tests can be performed by running make test. See the readmes for more details.

Testing

In the root project directory, run make test to run unit tests.

Feature extraction

Update the paths in paths.py. If you are using docker, this file should already contain the mounted directories. Extract filterbank and MFCC features moving to the directory (cd features) and then running the steps in features/readme.md.

Frame-level same-different evaluation

To perform frame-level same-different evaluation based on dynamic time warping (DTW), follow the steps in samediff/readme.md.

Downsampled acoustic word embeddings

Extract and evaluate downsampled acoustic word embeddings by running the steps in downsample/readme.md.

Neural acoustic word embeddings

Train and evaluate encoder-decoder recurrent neural network acoustic word embedding methods by running the steps in embeddings/readme.md.

Notebooks

Some example notebooks are given in the notebooks/ directory. Not that these were used mainly during development, so they are not completely refined. A docker container can be used to launch a notebook session by running ./docker_notebook.sh and then opening http://localhost:8889/.

Dependencies

Standalone packages:

Python 2.7
NumPy and SciPy
HTK: Used for MFCC feature extraction.
TensorFlow

Repositories from GitHub:

speech_dtw: Used for same-different evaluation. Should be cloned into the directory ../src/speech_dtw/, as done in the Preliminary section above.

All of these dependencies are packaged in the docker images.

License

The code is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).