Mix-Net

A transformer-based model for voice conversion, training with a few data (about 1min low noise speech), few training data, samll training epoch, better result.

The online experience, See the site here

This model begins by processing the audio signal with a UNet, which isolates the vocal track. The vocal track is then analyzed by PitchNet to extract pitch features. PitchNet is a specialized residual network designed for pitch extraction. Simultaneously, HuBERT is used to capture detailed features of the vocal track. The model also introduces noise through STFT transformation, which is combined with the extracted features and passed to MixNet. MixNet, based on a Transformer architecture with an encoder and decoder, is trained to generate a mask used to extract and replace the timbre of the audio, ultimately producing the final output audio.

The entire AI-powered process is implemented in mix.py, while the various network structure models can be referenced in the models folder.

Demo

The results of training on a 1-minute speech of Donald Trump are as follows:

Train 10 epoch(Hozier's Too Sweet)	Train 100 epoch(Hozier's Too Sweet)
te10_20s.webm	te100_20s.webm

You can experience creating your own voice online, See the site here

mojowebs/Mix-Net

Mix-Net

Demo