Note: This is an updated version of the recipe at https://github.com/kamperh/recipe_bucktsong_awe. The code here uses Python 3 (instead of Python 2.7) and uses LibROSA for feature extraction (instead of HTK). Because of slight differences in the resulting features, the results here does not exactly match those in the paper below, since the older recipe was used for the paper.
Unsupervised acoustic word embedding (AWE) approaches are implemented and evaluated on the Buckeye English and NCHLT Xitsonga speech datasets. The experiments are described in:
- H. Kamper, "Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models," in Proc. ICASSP, 2019. [arXiv]
Please cite this paper if you use the code.
The code provided here is not pretty. But I believe that research should be reproducible. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.
Portions of the Buckeye English and NCHLT Xitsonga corpora are used. The whole Buckeye corpus is used and a portion of the NCHLT data. These can be downloaded from:
- Buckeye corpus: buckeyecorpus.osu.edu
- NCHLT Xitsonga portion: www.zerospeech.com. This requires registration for the challenge.
From the complete Buckeye corpus we split off several subsets: the sets
labelled as devpart1
and zs
respectively correspond to the English1
and
English2
sets in Kamper et al., 2016. We
use the Xitsonga dataset provided as part of the Zero Speech Challenge 2015 (a
subset of the NCHLT data).
This recipe provides a Docker image containing all the required dependencies. The recipe can be run without Docker, but then the dependencies need to be installed separately (see below). To use the Docker image, you need to:
- Install Docker and follow the post installation steps.
- Install nvidia-docker.
To build the Docker image, run:
cd docker
docker build -f Dockerfile.gpu -t py3_tf1.13 .
cd ..
The remaining steps in this recipe can be run in a container in interactive mode. The dataset directories will also need to be mounted. To run a container in interactive mode with the mounted directories, run:
docker run --runtime=nvidia -it --rm -u $(id -u):$(id -g) -p 8887:8887 \
-v /r2d2/backup/endgame/datasets/buckeye:/data/buckeye \
-v /r2d2/backup/endgame/datasets/zrsc2015/xitsonga_wavs:/data/xitsonga_wavs \
-v "$(pwd)":/home \
py3_tf1.13
Alternatively, run ./docker.sh
, which executes the above command and starts
an interactive container.
To directly start a Jupyter notebook in a container, run ./docker_notebook.sh
and open http://localhost:8889/
.
If you are not using Docker, install the following dependencies:
To install speech_dtw
, clone the required GitHub repositories into ../src/
and compile the code as follows:
mkdir ../src/ # not necessary using docker
git clone https://github.com/kamperh/speech_dtw.git ../src/speech_dtw/
cd ../src/speech_dtw
make
make test
cd -
Update the paths in paths.py
to point to the datasets. If you are using
docker, paths.py
will already point to the mounted directories. Extract MFCC
and filterbank features in the features/
directory as follows:
cd features
./extract_features_buckeye.py
./extract_features_xitsonga.py
More details on the feature file formats are given in features/readme.md.
This is optional. To perform frame-level same-different evaluation based on dynamic time warping (DTW), follow samediff/readme.md.
Extract and evaluate downsampled acoustic word embeddings by running the steps in downsample/readme.md.
Train and evaluate neural network acoustic word embedding models by running the steps in embeddings/readme.md.
Some notebooks used during development are given in the notebooks/
directory.
Note that these were used mainly for debugging and exploration, so they are not
polished. A docker container can be used to launch a notebook session by
running ./docker_notebook.sh
and then opening http://localhost:8889/.
In the root project directory, run make test
to run unit tests.
The code is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).