Benjamin Harrison (bharrison49@gatech.edu)
Sidong Guo (sguo93@gatech.edu)
This is an implementation of an algorithm that allows users to select arbitrary non-overlapping regions of duration of any spoken content, and speed up or down each audio region by a corresponding scaling factor of users' choosing, without altering other speech characteristics such as pitch, amplitude, etc.
This implemention is based on ScalerGAN and Hi-Fi GAN
-
Follow instruction in ScalerGAN, exceptionally, download LJ Speech Dataset and place under scaler_gan/data/wavs
-
Follow instruction in Hi-Fi GAN, exceptionally, the default generator model is generator t2_v2, --checkpointfile argument can be changed in TermProject.py
-
The Directory hierarchy should be:
--scaler_gan
----hifi_gan
------generated_files_from_mel
------test_mel_files -
Place the spoken content/audio files you want to arbitrarily time scale under directory scaler_gan/data/Project.
-
Under scaler_gan (main) directory, run python TermProject.py
-
A UI will appear that allows users to select audio files to time scale and choose arbitrary segments with "commit changes"
-
Upon all audio segments are chosen with corresponding scaling factor, click "Complete Edits", a new wav file named "New_"+Originalwavfilename will be created that has its audio regions time scaled accordingly.