Welcome to My Voice Cloning and Realistic Video Generation project. This project was developed to create near-real-time digital twins using advanced voice cloning and video generation techniques. The solution combines cutting-edge AI technologies to produce lifelike digital replicas of individuals, complete with their voices, expressions, and speech.
The challenge is to develop AI models with the following capabilities:
-
Advanced Neural Architectures: I leveraged state-of-the-art deep learning techniques, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), for voice cloning and realistic video generation.
-
Expressiveness: My goal was to create models that could accurately convey a wide range of emotions, accents, and speaking styles. This enables expressive voice cloning and natural video generation from 2D images.
-
Naturalness: I focused on making the generated voice clones sound completely natural and human-like. Additionally, I paid close attention to achieving precise lip-sync and realistic video corresponding to the cloned audio.
-
Real-Time Nature: I built an ensemble of voice cloning and video generation models designed to operate in near real-time. This makes my solution suitable for various conversational AI applications.
My approach to solving this challenge involved two distinct components:
I utilized the Tortoise-TTS repository to implement both voice cloning and text-to-speech capabilities. This component allows users to upload audio samples for voice cloning and specify text prompts for generating cloned voices.
For the generation of lifelike videos with precise lip-sync, I integrated the SadTalker repository. This component takes an input image, an audio file from the voice cloning step, and produces a video with seamless lip-sync.
To handle processing requirements efficiently and prevent crashes, I employed separate Google Colab instances for each component. Additionally, I configured ngrok with Flask to create user-friendly URLs for easy integration with a Streamlit application.
Before you begin, ensure you have met the following requirements:
Requirement | Version |
---|---|
Python | >= 3.6 |
TensorFlow | >= 2.0 |
PyTorch | >= 1.0 |
To install the required dependencies, follow these steps:
-
Clone the repository:
git clone https://github.com/bruno-noir/voice-cloning-video-generation.git
-
Install the necessary packages:
pip install -r requirements.txt
To run the entire system, follow these steps:
-
Upload the
TorTTS_API.ipynb
notebook to one Colab instance and theVid_API.ipynb
notebook to another Colab instance. -
Configure ngrok APIs in both instances.
-
Enter the ngrok URLs generated in step 2 into the
app.py
file. -
Launch the Streamlit application using the command:
streamlit run app.py
. -
In the Streamlit application, users can perform the following actions:
- Upload a sample audio file (.wav) with a duration of 10 to 15 seconds for voice cloning.
- Specify a text prompt for generating speech.
- Upload an image (.png) of the person whose voice is to be cloned.
- Witness the magic as the system crafts a video with seamless lip-sync using the provided audio and image.
This project is licensed under the MIT License - see the LICENSE file for details.