Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT

This is the official repository of the IEEE SLT 2024 paper Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT

Setup

conda create -y -n py310 python=3.10.14 pip=24.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh

Usage: encoding waveforms into pseudo-syllabic units

import torchaudio

from src.speaker_disentangled_hubert import BYOLForSyllableDiscovery

wav_path = "/path/to/wav"

# download a pretrained model from hugging face hub
model = BYOLForSyllableDiscovery.from_hf_hub().cuda()

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into pseudo-syllabic units
outputs = model(waveform.cuda())

# pseudo-syllabic units
units = outputs["units"]  # [3950, 67, ..., 503]

Demo

Google Colab demo is found here.

Models

Download models from the following links.

Model	Link
Speaker-disentangled HuBERT	download
KMeans	download
Agglomerative clustering	download

Other models can be downloaded from Hugging Face.

Data Preparation

If you already have LibriSpeech, you can use it by editing a config file;

dataset:
  root: "/path/to/LibriSpeech/root" # ${dataset.root}/LibriSpeech/train-clean-100, train-clean-360, ...

otherwise you can download the new one under dataset_root.

dataset_root=data  # be consistent with dataset.root in a config file

sh scripts/download_librispeech.sh ${dataset_root}

Check the directory structure

dataset.root in a config file
└── LibriSpeech/
    ├── train-clean-100/
    ├── train-clean-360/
    ├── train-other-500/
    ├── dev-clean/
    ├── dev-other/
    ├── test-clean/
    ├── test-other/
    └── SPEAKERS.TXT

Training & Evaluation

python main.py --config configs/default.yaml

Citation

@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
  author = {Komatsu, Ryota and Shinozaki, Takahiro},
  title = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT},
  year = {2024},
  month = {Dec.},
  booktitle = {IEEE Spoken Language Technology Workshop},
  pages = {},
}

vinicius-ianni/speaker_disentangled_hubert