Arbitrary Modification of Speech Characteristics in Segmental Duration

Benjamin Harrison (bharrison49@gatech.edu)
Sidong Guo (sguo93@gatech.edu)

Description

This is an implementation of an algorithm that allows users to select arbitrary non-overlapping regions of duration of any spoken content, and speed up or down each audio region by a corresponding scaling factor of users' choosing, without altering other speech characteristics such as pitch, amplitude, etc.

This implemention is based on ScalerGAN and Hi-Fi GAN

Steps

  1. Follow instruction in ScalerGAN, exceptionally, download LJ Speech Dataset and place under scaler_gan/data/wavs

  2. Follow instruction in Hi-Fi GAN, exceptionally, the default generator model is generator t2_v2, --checkpointfile argument can be changed in TermProject.py

  3. The Directory hierarchy should be:
    --scaler_gan
    ----hifi_gan
    ------generated_files_from_mel
    ------test_mel_files

  4. Place the spoken content/audio files you want to arbitrarily time scale under directory scaler_gan/data/Project.

  5. Under scaler_gan (main) directory, run python TermProject.py

  6. A UI will appear that allows users to select audio files to time scale and choose arbitrary segments with "commit changes"

  7. Upon all audio segments are chosen with corresponding scaling factor, click "Complete Edits", a new wav file named "New_"+Originalwavfilename will be created that has its audio regions time scaled accordingly.