This repository has 3 main purposes:
- Retrieving metadata from LibriVox (
scripts/get_librivox_overview.py
) Analysing the audio quality of retrieved recordings (WIPscripts/hifi_qa.py
)- Creating datasets from selected audiobooks (
scripts/createDataset.py
)
- Linux
- Anaconda
Navigate to the cloned repository
Create a new conda environment
conda create -n dataset_creation python=3.10
conda activate dataset_creation
Install the package as develop python package
python setup.py develop
Installation of dependencies
(NB: This repo currently uses the Gutenberg package (https://pypi.org/project/Gutenberg/) which requires you to install BSD-DB via your distribution's package manager.)
pip install -r requirements_complete_20240317.txt
runscripts/get_librivox_overview.py
to find speakersrunscripts/hifi_qa.py
to identify speakers with sufficient recording quality- create a dataset config in
scripts/createDatasetConfig
(some examples are given) - modify
scripts/createDataset.py
: adjustexternal_paths
, assign the dataset config path to a variable, and assign this variable toall_configs
(this will be streamlined at some point) - run
scripts/createDataset.py
; the script will likely crash in Step 3_1 (Prepare Text) and print the cases which were not normalized successfully. Currently, you need to modify the dataset config"text_replacement"
to account for these cases. - If Step 5 (Align Text) fails, try manually removing the table of contents and any unread chapters from the text (in the folders for Step3 and Step3_1).
- Once all steps have finished, you will find the metadata under
<database_folder>/final_dataset/metadata.csv
and the clean dataset underdatabase_folder>/final_dataset_clean
(no separate csv is created as of now).
- Step 3_1 (Prepare Text): needs manual book-specific text replacements (in the corresponding dataset config .json file)
- Step 4_1 (Normalize Transcript): does not use the default text replacements but only the book-specific replacmenets
- Step 4 (Transcript Audio): Whisper predicts digits for numbers instead of spelled-out words
- Step 5 (Align Text): The current algorithm has problems with lengthy table of contents at the beginning, and alignment for selected chapters of the text is not robust
- multilingual dataset creation (based on Whisper ASR)
- multilingual automated number normalization
- dataset creation for solo readings and chapter-wise readings
- audio splits based on sentence borders
not yet released:
- multiple frequency band SNR (based on VAD)
- DeepXi SNR
- bandwidth analysis
- WVMOS integration
- room acoustics analysis
- (more detailed) F0 analysis
This repository is largely based on the code of HUI Audio Corpus German - thanks for open-sourcing your code!
Further code acknowledgements:
- DeepXi (https://github.com/anicolson/DeepXi)
- WV-MOS (https://github.com/AndreevP/wvmos)
- Nemo (https://github.com/NVIDIA/NeMo-text-processing)
- Whisper (https://github.com/openai/whisper)
Inspiration from papers:
- Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang: Hi-Fi Multi-Speaker English TTS Dataset (https://arxiv.org/abs/2104.01497)
- Sewade Ogun, Vincent Colotte, Emmanuel Vincent: Can we use Common Voice to train a Multi-Speaker TTS system? (https://arxiv.org/abs/2210.06370)
Finally, thanks to the LibriVox community for providing an amazing public-domain resource.