A transformer-based model for voice conversion, training with a few data (about 1min low noise speech), few training data, samll training epoch, better result.
The online experience, See the site here
This model begins by processing the audio signal with a UNet
, which isolates the vocal track.
The vocal track is then analyzed by PitchNet
to extract pitch features. PitchNet
is a specialized
residual network designed for pitch extraction. Simultaneously, HuBERT
is used to
capture detailed features of the vocal track. The model also introduces noise through STFT transformation,
which is combined with the extracted features and passed to MixNet
. MixNet
, based on a Transformer
architecture with an encoder and decoder, is trained to generate a mask used to extract and replace the
timbre of the audio, ultimately producing the final output audio.
The entire AI-powered process is implemented in mix.py
, while the various network structure
models can be referenced in the models
folder.
The results of training on a 1-minute speech of Donald Trump are as follows:
Train 10 epoch(Hozier's Too Sweet) |
Train 100 epoch(Hozier's Too Sweet) |
te10_20s.webm |
te100_20s.webm |
You can experience creating your own voice online, See the site here