This is a repository of Textless Phoneme Aligner with Common Phone, a gender-balanced, multilingual corpus recorded from more than 11.000 contributors via Mozilla's Common Voice project. The corpus comprises around 116 hours of speech enriched with automatically generated phonetic segmentation.
The pytorch dataset reads the metadata row by row, including read the filepath as audio array (input_values), and outputs the audio array along with other information in a dictionary.
- run command
pip install -r requirements.txt
- install
apex
following this issue
Download the dataset from this link and extract the file to the current dir
After downloading the dataset from above link, Run the following commands to processing the downloaded data:
python metadata_processing.py
python etl.py --file METADATA
where:
metadata_processing.py
assume the extracted folder is loacated in the current dirMETADATA
: metadata generated by the 1st command
Run the following command:
python hf_train.py --arg ARG
where arg
and ARG
can be any given key-value pair in segmentation_config.json
and training_config.json
. In additon, the following arguments need to be specified:
--datadir
the data dir of common phone--output_data_dir
: the output dir--action
: usetrain
For training in AWS Sagemaker Training, you can follow the official instruction and use sagemaker_entry.py
as entry_point