FastVC is a fast and efficient, non-parallel and any-to-any voice conversion (VC) tool. VC involves the modification of the voice of a source speaker to make it sound like that of a target speaker, without changing the linguistic content of the sentence. Our tool exploits the task by cascading an Automatic Speech Recognition (ASR) model and a Text To Speech (TTS) model.
The ASR is based on Wav2vec 2.0 and is used to transcribe the speech from a source speaker. The TTS is based on SV2TTS and is used to generate the output speech from a target speaker embedding.
For a more detailed explanation check out the paper of our project. A demo page is available here.
The software was implemented using python 3.9.4
- Clone the repository (
git clone https://github.com/fmiotello/fastVC.git
) and enter the directory (cd fastVC
) - (optional) Create virtual env and activate it:
python -m venv env
andsource env/bin/activate
(if using macOS/Linux) or.\env\Scripts\activate
(if using Windows) - Upgrade pip:
python -m pip install --upgrade pip
- Install dependencies:
python -m pip install -r requirements.txt
- Download the pretrained models (encoder, synthesizer, vocoder) and put them in the correct directories:
./src/encoder/saved_models/pretrained.pt
./src/synthesizer/saved_models/pretrained/pretrained.pt
./src/vocoder/saved_models/pretrained/pretrained.pt
- Run the main script:
python src/main.py
(use--help
for displaying available options). The output audio will be./src/audio/audio_out.wav
.
More instructions can be found here.
This application was developed as a project at Politecnico di Milano (MSc in Music and Acoustic Engineering).
Luigi Attorresi
Federico Miotello
Eugenio Poliuti
Marzieh Ali Atashi is a master's student from South Tehran University
40114140111030 student number
Digital signal processing course
Professor Dr. Mahde Eslami
https://github.com/fmiotello by Marzieh Ali Atashi
Summary by Marzieh Ali Atashi
The fastVC project is a fast and efficient non-parallel and audio conversion tool. in which the voice of a source speaker is similar to the voice of a target speaker without changing the temporal content of the language and is shown in the output speaker.
This is a waterfall model. This waterfall model consists of three main parts ASR, Transcription, TTs
In this project, there is an automatic speech recognition model, a text-to-speech model, and a speech-to-code conversion model. The source code of the speech-to-transcription and text conversion part is performed by the encoder and the related sourcecode is in the project. Text-to-speech conversion is performed with the voice of the target speaker in the Synthesizer and Vocoder section, and the source code for its implementation is also available.
The remarkable thing about the fastVc project is that this cascade model has been used as a base in most of the projects of converting speech to other languages, changing speech to text, encoding, voice encryption, and the output of this cascade model. This basic waterfall model is used in the main set of all projects implemented in all 120 languages in the world. This cascade model can be taught and this training is well seen in the use of other languages and has been very efficient.
https://www.aparat.com/maissa0 There are 37 videos of project steps and training related to Kolb in this section
https://drive.google.com/file/d/1YK-vMSDA0TSGzh8QP61V7RwHPr3qBFor/view?usp=share_link
https://drive.google.com/file/d/1Y_QiI4wbRLuFqbBmL9zW4Vl-pOqY1-BV/view?usp=share_link
https://drive.google.com/file/d/1FrUpP2ulNwG6WOHDppnLG5qzjsRiNI0m/view?usp=share_link
ارائه نهایی پروژه
https://drive.google.com/file/d/1NBFunokpV1ouKR0O8zT4Rv0SgE9U5IXR/view?usp=sharing
https://drive.google.com/file/d/1m1M8LR6Jx4DgGkzUfefTyowCKUzWA7uG/view?usp=sharing
https://drive.google.com/file/d/1cPZ301ICJCeF28NWO1YN3jxDUa_kCXrW/view?usp=sharing
https://drive.google.com/file/d/1JdcZFGb3qRrQAgcIErNyxTD1--6EpFpI/view?usp=sharing
https://drive.google.com/file/d/1RjPt6wzL48qBaCeOAzaSFWfDahb-ZOTR/view?usp=sharing