Phoneme Boundary Detection using Learnable Segmental Features (ICASSP 2020)

This repository provides a PyTorch implementation of the paper Phoneme Boundary Detection using Learnable Segmental Features.

Paper

Phoneme Boundary Detection using Learnable Segmental Features
Felix Kreuk, Yaniv Sheena, Joseph Keshet, Yossi Adi
45th International Conference on Acoustics, Speech, and Signal Processing ICASSP 2020

Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superior to the baseline models and reaches state-of-the-art performance in terms of F1 and R-value. We further explore the use of phonetic transcription as additional supervision and show this yields minor improvements in performance but substantially better convergence rates. We additionally evaluate the model on a Hebrew corpus and demonstrate such phonetic supervision can be beneficial in a multi-lingual setting.

If you find this implementation useful, please consider citing our work:

@article{kreuk2020phoneme,
  title={Phoneme Boundary Detection using Learnable Segmental Features},
  author={Kreuk, Felix and Sheena, Yaniv and Keshet, Joseph and Adi, Yossi},
  journal={arXiv preprint arXiv:2002.04992},
  year={2020}
}

Dependencies

loguru==0.4.1
boltons==20.0.0
pandas==1.0.0
pytorch-lightning==0.6.0
SoundFile==0.10.3.post1
test-tube==0.7.5
torch==1.4.0
torchaudio==0.4.0
torchvision==0.4.2
tqdm==4.42.1

Usage

1. Clone the repository

git clone https://github.com/felixkreuk/SegFeat.git
cd SegFeat

2. Data structure

The dataloader in dataloader.py assumes the dataset is structured as follows:

timit_directory
│
└───val
│   │   X.wav
│   └─  X.phn
│
└───test
│   │   Y.wav
│   └─  Y.phn
│
└───train
    │   Z.wav
    └─  Z.phn

Where X.wav is a raw waveform signal, and X.phn is its' corresponding phoneme boundaries labeld with the following format:

0 9640 h#
9640 11240 sh
11240 12783 iy
12783 14078 hv
14078 16157 ae
16157 16880 dcl
...

Where the two numbers each line represent the onset of offset of the phoneme (in samples), and the last element represents the phoneme identity.

2. Training

python main.py --wav_path /path/to/timit/dataset --dataset timit --delta_feats --dist_feats

If --ckpt /path/to/model.ckpt is present, then the training will resume from the given checkpoint.
Testing will begin when training finishes (max epochs is reached or when training is stopped via early-stopping).
For more details regarding possible run arguments, please see python main.py --help:

usage: main.py [-h] [--wav_path WAV_PATH] [--dataset {timit,buckeye}]
               [--run_dir RUN_DIR] [--exp_name EXP_NAME]
               [--load_ckpt LOAD_CKPT] [--gpus GPUS] [--devrun]
               [--devrun_size DEVRUN_SIZE] [--lr LR] [--optimizer OPTIMIZER]
               [--momentum MOMENTUM] [--epochs EPOCHS] [--batch_size N]
               [--dropout DROPOUT] [--seed SEED] [--patience PATIENCE]
               [--gamma GAMMA] [--overfit OVERFIT]
               [--val_percent_check VAL_PERCENT_CHECK]
               [--val_check_interval VAL_CHECK_INTERVAL]
               [--val_ratio VAL_RATIO] [--rnn_input_size RNN_INPUT_SIZE]
               [--rnn_hidden_size RNN_HIDDEN_SIZE] [--rnn_dropout RNN_DROPOUT]
               [--birnn] [--rnn_layers RNN_LAYERS]
               [--min_seg_size MIN_SEG_SIZE] [--max_seg_size MAX_SEG_SIZE]
               [--max_len MAX_LEN] [--feats {mfcc,mel,spect}] [--random_trim]
               [--delta_feats] [--dist_feats] [--normalize]
               [--bin_cls BIN_CLS] [--phn_cls PHN_CLS] [--n_fft N_FFT]
               [--hop_length HOP_LENGTH] [--n_mels N_MELS] [--n_mfcc N_MFCC]

segmentation

optional arguments:
  -h, --help            show this help message and exit
  --wav_path WAV_PATH
  --dataset {timit,buckeye}
  --run_dir RUN_DIR     directory for saving run outputs (logs, ckpt, etc.)
  --exp_name EXP_NAME   experiment name
  --load_ckpt LOAD_CKPT
                        path to a pre-trained model, if provided, training
                        will resume from that point
  --gpus GPUS
  --devrun              dev run on a dataset of size `devrun_size`
  --devrun_size DEVRUN_SIZE
                        size of dataset for dev run
  --lr LR               initial learning rate
  --optimizer OPTIMIZER
  --momentum MOMENTUM   momentum
  --epochs EPOCHS       upper epoch limit
  --batch_size N        batch size
  --dropout DROPOUT     dropout probability value
  --seed SEED           random seed
  --patience PATIENCE   patience for early stopping
  --gamma GAMMA         gamma margin
  --overfit OVERFIT     gamma margin
  --val_percent_check VAL_PERCENT_CHECK
                        how much of the validation set to check
  --val_check_interval VAL_CHECK_INTERVAL
                        validation check every K epochs
  --val_ratio VAL_RATIO
                        precentage of validation from train
  --rnn_input_size RNN_INPUT_SIZE
                        number of inputs
  --rnn_hidden_size RNN_HIDDEN_SIZE
                        RNN hidden layer size
  --rnn_dropout RNN_DROPOUT
                        dropout
  --birnn               BILSTM, if define will be biLSTM
  --rnn_layers RNN_LAYERS
                        number of lstm layers
  --min_seg_size MIN_SEG_SIZE
                        minimal size of segment, examples with segments
                        smaller than this will be ignored
  --max_seg_size MAX_SEG_SIZE
                        see `min_seg_size`
  --max_len MAX_LEN     maximal size of sequences
  --feats {mfcc,mel,spect}
                        type of acoustic features to use
  --random_trim         if this flag is on seuqences will be randomly trimmed
  --delta_feats         if this flag is on delta features will be added
  --dist_feats          if this flag is on the euclidean features will be
                        added (see paper)
  --normalize           flag to normalize features
  --bin_cls BIN_CLS     coefficient of binary classification loss
  --phn_cls PHN_CLS     coefficient of phoneme classification loss
  --n_fft N_FFT         n_fft for feature extraction
  --hop_length HOP_LENGTH
                        hop_length for feature extraction
  --n_mels N_MELS       number of mels
  --n_mfcc N_MFCC       number of mfccs

3. Testing

To run a test epoch run the following command:

python main.py --wav_path /path/to/timit/ --dataset timit --delta_feats --dist_feats --load_ckpt segmentor.ckpt --test

yosishrem/SegFeat