Work in-progress reproduction of the text-to-speech (TTS) model from the paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
Reproducing the TTS model requires the following 5 steps to be completed in order:
- Train the Accent Classifier
- Annotate the Training Set
- Aggregate Statistics
- Create Descriptions
- Train the Model
The script run_audio_classification.py
can be used to train an audio encoder model from
the Transformers library (e.g. Wav2Vec2, MMS, or Whisper) for the accent
classification task.
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the
audio embeddings to class label predictions. The audio encoder can either be frozen (--freeze_base_model
) or trained.
The linear classifier is randomly initialised, and is thus always trained.
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
separating dataset names, configs and splits by the +
character in the launch command (see below for an example).
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen MMS-LID model, and training the linear classifier on a combination of three open-source datasets:
- The English Accented (
en_accented
) subset of Voxpopuli - The train split of VCTK
- The dev split of EdAcc
The model is subsequently evaluated on the test split of EdAcc to give the final classification accuracy.
#!/usr/bin/env bash
python run_audio_classification.py \
--model_name_or_path "facebook/mms-lid-126" \
--train_dataset_name "vctk+facebook/voxpopuli+sanchit-gandhi/edacc" \
--train_dataset_config_name "main+en_accented+default" \
--train_split_name "train+test+validation" \
--train_label_column_name "accent+accent+accent" \
--eval_dataset_name "sanchit-gandhi/edacc" \
--eval_dataset_config_name "default" \
--eval_split_name "test" \
--eval_label_column_name "accent" \
--output_dir "./" \
--do_train \
--do_eval \
--overwrite_output_dir \
--remove_unused_columns False \
--fp16 \
--learning_rate 1e-4 \
--max_length_seconds 20 \
--attention_mask False \
--warmup_ratio 0.1 \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--preprocessing_num_workers 16 \
--dataloader_num_workers 4 \
--logging_strategy "steps" \
--logging_steps 10 \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--load_best_model_at_end True \
--metric_for_best_model "accuracy" \
--save_total_limit 3 \
--freeze_base_model \
--push_to_hub \
--trust_remote_code
Tips:
- Number of labels: normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps balance the dataset by removing labels with very few examples. You can modify the function
preprocess_labels
to implement any custom normalisation strategy.
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate.
Aggregate statistics from Step 2. Convert continuous values to discrete labels.
Convert sequence of discrete labels to text description (using an LLM).
Train MusicGen-style model on the TTS task.