Textless Phoneme Aligner with Multilingual Voice Dataset

This is a repository of Textless Phoneme Aligner with Common Phone, a gender-balanced, multilingual corpus recorded from more than 11.000 contributors via Mozilla's Common Voice project. The corpus comprises around 116 hours of speech enriched with automatically generated phonetic segmentation.

The pytorch dataset reads the metadata row by row, including read the filepath as audio array (input_values), and outputs the audio array along with other information in a dictionary.

Setup

Code

run command pip install -r requirements.txt
install apex following this issue

Data

Download the dataset from this link and extract the file to the current dir

Run

Processing data with raw source

After downloading the dataset from above link, Run the following commands to processing the downloaded data:

python metadata_processing.py
python etl.py --file METADATA

where:

metadata_processing.py assume the extracted folder is loacated in the current dir
METADATA: metadata generated by the 1st command

Experimentation

Run the following command:

python hf_train.py --arg ARG

where arg and ARG can be any given key-value pair in segmentation_config.json and training_config.json. In additon, the following arguments need to be specified:

--datadir the data dir of common phone
--output_data_dir: the output dir
--action: use train

For training in AWS Sagemaker Training, you can follow the official instruction and use sagemaker_entry.py as entry_point