/LibriPhrase

Recipe for LibriPhrase

Primary LanguagePythonMIT LicenseMIT

Recipe for LibriPhrase

πŸ“’ [09/02/23] We released an official test dataset used in our paper. For the fair comparison, you may use this dataset.

About the dataset

"LibriPhrase" is an open-source dataset tailored for tasks like user-defined keyword spotting (UDKWS), open-vocabulary keyword spotting (OV-KWS), and spoken-term detection (STD). It is constructed based on the LibriSpeech corpus, and we adhere to the license terms of LibriSpeech.

In this recipe, we have taken advantage of exceptional tools and libraries, listed as follows:

  • Source : LibriSpeech (ASR corpus) [1]
  • Alignment : Montreal Forced Aligner [2][3]
  • String distance metric : Levenshtein distance [4]
  • Unit : Phoneme-level (extract phoneme information from G2P [5])

Examples of LibriPhrase

Anchor text Easy negative text Hard negative text
friend guard
comfort
superior
frind
rend
trend
the river every morning
town with
not occurred
the giver
the liver
the rigor
i mean to and be made
be a banner
no less than
i seen to
i mean you
we mean to
at the right time began the kissing and
rubbing two bits of
conseil and land spent
at the same time
at the one time
knew the right time

Data pipeline

./libriphrase.py
./utils.py
./run.sh
./requirements.txt
./data/
  β”œβ”€β”€ librispeech_clean_dev_all_utt.csv
  β”œβ”€β”€ ...
  β”œβ”€β”€ librispeech_other_train_500h_all_utt.csv
  β”œβ”€β”€ ...
  β”œβ”€β”€ libriphrase_diffspk_all_1word.csv
  β”œβ”€β”€ ...
  └── libriphrase_diffspk_all_4word.csv
  
/LibriSpeech_ASR_corpus/
  β”œβ”€β”€ dev-clean/
  β”œβ”€β”€ ...
  └── train-other-500/
       β”œβ”€β”€ BOOKS.TXT
       β”œβ”€β”€ CHAPTERS.TXT
       β”œβ”€β”€ ...
       └── train-other-500/
            β”œβ”€β”€ 1006/
            β”œβ”€β”€ ...
            └── 985/
                β”œβ”€β”€ 126224/       
                β”œβ”€β”€ ...
                └── 126228/
                    β”œβ”€β”€ 985-126228-0000.wav
                    ...
/LibriPhrase_diffspk_all/
  β”œβ”€β”€ dev-clean/
  β”œβ”€β”€ ...
  └── train-other-500/
       └── train-other-500/
            β”œβ”€β”€ 1006/
            β”œβ”€β”€ ...
            └── 985/
                β”œβ”€β”€ 126224/       
                β”œβ”€β”€ ...
                └── 126228/
                    β”œβ”€β”€ 985-126228-0004_1word_0.wav
                    ...

Getting started

Environment

This work is performed in this environment.

  • Python 3.6.9
  • Linux Ubuntu 18.04

1. Preparation

Before you start, make sure to prepare the LibriSpeech ASR corpus.
Once the download is complete, proceed to clone this repository and install the package dependencies using the following steps:

git clone https://github.com/gusrud1103/LibriPhrase.git
cd LibriPhrase
pip install -r requirements.txt

Afterwards, download alignment CSV files from Google Link and place the files into data folder.

mkdir data
cd data      # locate CSV files in this folder

2. Generating LibriPhrase

This recipe is designed to extract short phrases (consisting of 1~4 words) from the LibriSpeech dataset and organize them into three categories: anchor, positive, and negative.
By controlling the mode argument in the script, you can customize the dataset's difficulty (easy/hard) and speaker type (same/different).
The script will create a hierarchical folder structure based on both the LibriSpeech dataset and manual arguments (Please refer the data pipeline).

You can use this recipe to generate test dataset. (We will also release the recipe for the training set soon.)
To ensure a fair comparison with our paper, we offer the official test dataset.
We generated the test data from β€œtrain-other-500” of LibriSpeech.

./run.sh

or

python3 libriphrase.py --libripath 'your path(librispeech wav files)' --newpath 'new path(libriphrase wav files)' --wordalign './data/librispeech_other_train_500h_all_utt.csv' --output './data/testset_librispeech_other_train_500h_short_phrase.csv' --numpair 3 --maxspk 1611 --maxword 4 --mode 'diffspk_all'

Arguments

  • --libripath : Folder for LibriSpeech ASR corpus (wav files)
  • --newpath : Folder to save generated LibriPhrase wav files
  • --wordalign : Word alignment information for LibriSpeech ASR corpus (Download csv files to data folder)
  • --output : Output filename that containing the information about generated LibriPhrase in csv format
  • --numpair : The number of samples in each case
  • --maxspk : The number of speakers (for reducing computation)
  • --maxword : The maximum number of words to construct short phrase (Select integer from 1 to 4)
  • --mode : The mode for comparison type (samespk/diffspk denote the consistency of the speaker between anchor and comparison and easy/hard/all denote negative type [samespk_easy, diffspk_easy, diffspk_hard, diffspk_all]

3. Results

Columns:

  • anchor : The file path of the anchor wav file
  • anchor_spk : The speaker of the anchor wav file
  • anchor_text : The text of the anchor wav file
  • anchor_dur: The duration of the anchor wav file
  • comparison: The file path of the comparison wav file
  • comparison_spk : The speaker of the comparison wav file
  • comparison_text : The text of the comparison wav file
  • comparison_dur : The duration of the comparison wav file
  • type : The category of the comparison (it depends on mode, so if mode is samespk_easy, then samespk_positive, samespk_easyneg are showed in the type column.)
mode type category
samespk_easy samespk_positive, samespk_easyneg
diffspk_easy diffspk_positive, diffspk_easyneg
diffspk_hard diffspk_positive, diffspk_hardneg
diffspk_all diffspk_positive, diffspk_easyneg, diffspk_hardneg
  • target : If a test sample (comparison) matches with the given anchor (text), the value is 1, otherwise 0.
  • class : The number of words in the phrase

Reference

[1] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, β€œLibrispeech: an asr corpus based onpublic domain audio books,” in ICASSP, 2015.
[2] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio, "Speech Model Pre-training for End-to-End Spoken Language Understanding", INTERSPEECH 2019.
[3] Michael McAuliffe, Michaela Socolof, Sarah Mihuc,Michael Wagner, and Morgan Sonderegger, β€œMontreal forced aligner: Trainable text-speech alignment using kaldi.,” in INTERSPEECH, 2017.
[4] Vladimir I Levenshtein et al., β€œBinary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady. Soviet Union, 1966, vol. 10, pp. 707–710.
[5] Jongseok Park, Kyubyong Kim, β€œg2pe,”https://github.com/Kyubyong/g2p, 2019.

License

Distributed under the MIT License. See LICENSE for more information.

Citation

If you use this code, please cite:

@inproceedings{Shin22LibriPhrase,
  author={Hyeon-Kyeong Shin and Hyewon Han and Doyeon Kim and Soo-Whan Chung and Hong-Goo Kang},
  title={Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting},
  year=2022,
  booktitle={INTERSPEECH},
  pages={1871--1875},
  doi={10.21437/Interspeech.2022-580}
}