
Unsupervised segmentation and clustering of Buckeye English and NCHLT Xitsonga corpora.

Recipe: Segmentation and Clustering of Buckeye English and NCHLT Xitsonga



This is a recipe for unsupervised segmentation and clustering of subsets of the Buckeye English and NCHLT Xitsonga corpora. Details of the approach is given in Kamper et al., 2016:

  • H. Kamper, A. Jansen, and S. J. Goldwater, "A segmental framework for fully-unsupervised large-vocabulary speech recognition," arXiv preprint arXiv:1606.06950, 2016.

Please cite this paper if you use this code.

The recipe below makes use of the separate segmentalist package which performs the actual unsupervised segmentation and clustering and was developed together with this recipe.


The code provided here is not pretty. But I believe that research should be reproducible, and I hope that this repository is sufficient to make this possible for the paper mentioned above. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.


Portions of the Buckeye English and NCHLT Xitsonga corpora are used. The whole Buckeye corpus will be required to execute the steps here, and the portion of the NCHLT data. These can be downloaded from:

From the complete Buckeye corpus we split off several subsets. The most important are the sets labelled as devpart1 and zs in the code here. These sets respectively correspond to English1 and English2 in Kamper et al., 2016, so see the paper for more details. More details of which speakers are found in which set is also given at the end of features/readme.md. We use the entire Xitsonga dataset provided as part of the Zero Speech Challenge 2015 (this was already a subset of the NCHLT data).


Obtain all the datasets as described in the Datasets section described above. Install all the standalone dependencies (see Dependencies section below). Then clone the required GitHub repositories into ../src/ as follows:

mkdir ../src/
git clone https://github.com/kamperh/segmentalist.git ../src/segmentalist/
git clone https://github.com/kamperh/speech_correspondence.git \
git clone https://github.com/kamperh/speech_dtw.git ../src/speech_dtw/
git clone https://github.com/bootphon/tde.git ../src/tde

For both segmentalist and speech_dtw, you need to run make to build. Unit tests can be performed by running make test. See the readmes for more details.

The speech_correspondence and speech_dtw repositories are only necessary if you plan to do correspondence autoencoder (cAE) feature extraction. This repository uses the Theano and Pylearn2 dependencies, which is unnecessary if cAE features will not be used. The tde repository is only necessary if you plan to also calculate the evaluation metrics from the Zero Resource Speech Challenge 2015; without tde you will not be able to calculate the metrics in Section 4.5 of Kamper et al., 2016, but you will still be able to calculate the other metrics in the paper.

The tde package itself needs to be setup. In ../src/tde/ run the following:

python setup.py build_ext --inplace
python setup_freeze.py build_exe
python move_build.py english english_dir
python move_build.py xitsonga xitsonga_dir

Feature extraction

Some preprocessed resources are given in features/data/. Extract MFCC features by running the steps in features/readme.md. Some steps are optional depending on whether you intend to train a cAE (see below).

Correspondence autoencoder features (optional)

In Kamper et al., 2016 we compare both MFCCs and correspondence autoencoder (cAE) features as input to our system. It is not necessary to perform the steps below if you are happy with using MFCCs. The cAE was first introduced in this paper:

  • H. Kamper, M. Elsner, A. Jansen, and S. J. Goldwater, "Unsupervised neural network based feature extraction using weak top-down constraints," in Proc. ICASSP, 2015.

The cAE is trained on word pairs discovered using an unsupervised term discovery (UTD) system (based on the code available here). This UTD system does not form part of the repository here. Instead, the output word pairs discovered by the UTD system are provided as part of the repository in the following files:

  • English pairs: features/data/buckeye.fdlps.0.93.pairs
  • Xitsonga pairs: features/data/zs_tsonga.fdlps.0.925.pairs.v0

The MFCC features for these pairs were extracted as part of feature extraction (previous section).

To train the cAE, run the steps in cae/readme.md.

Unsupervised syllable boundary detection

We use the unsupervised syllable boundary detection algorithm described in:

  • O. J. Räsänen, G. Doyle, and M. C. Frank, "Unsupervised word discovery from speech using automatic segmentation into syllable-like units," in Proc. Interspeech, 2015.

Rather than packaging their code within our repository, we provide the output of their tools directly in syllables/landmarks/. All that remains is to extract subsets of Buckeye; run the following:

cd syllables
./get_landmarks_subset.py devpart1
./get_landmarks_subset.py zs

Downsampling: acoustic word embeddings

We use one of the simplest methods to obtain acoustic word embeddings: downsampling. We downsample both MFCC features and cAE features. Run the steps in downsample/readme.md.

Segmentalist: Unsupervised segmentation and clustering

Segmentation and clustering is performed using the segmentalist package. Run the steps in segmentation/readme.md.


Standalone packages:

  • Python
  • Cython: Used by the segmentalist and speech_dtw repositories below.
  • NumPy and SciPy.
  • HTK: Used for MFCC feature extraction.
  • Theano: Required by the speech_correspondence repository below.
  • Pylearn2: Required by the speech_correspondence repository below.

Repositories from GitHub:

  • segmentalist: This is the main segmentation software developed as part of this project. Should be cloned into the directory ../src/segmentalist/, done in the Preliminary section above.
  • speech_correspondence: Used for correspondence autoencoder feature extraction. Should be cloned into the directory ../src/speech_correspondence/, as done in the Preliminary section above.
  • speech_dtw: Used for correspondence autoencoder feature extraction. Should be cloned into the directory ../src/speech_dtw/, as done in the Preliminary section above.
  • tde: The Zero Resource Speech Challenge evaluation tools. Should be cloned into the directory tde/, as done in the Preliminary section above.