VideoLingo is an all-in-one video translation and localization dubbing tool, aimed at generating Netflix-quality subtitles, eliminating stiff machine translations and multi-line subtitles, while also adding high-quality dubbing. It enables knowledge sharing across language barriers worldwide. Through an intuitive Streamlit web interface, you can complete the entire process from video link to embedded high-quality bilingual subtitles and even dubbing with just a few clicks, easily creating Netflix-quality localized videos.
Key features and functionalities:
-
🎥 Uses yt-dlp to download videos from YouTube links
-
🎙️ Uses WhisperX for word-level timeline subtitle recognition
-
📝 Uses NLP and GPT for subtitle segmentation based on sentence meaning
-
📚 GPT summarizes and extracts terminology knowledge base for context-aware translation
-
🔄 Three-step direct translation, reflection, and paraphrasing, rivaling professional subtitle translation quality
-
✅ Checks single-line length according to Netflix standards, absolutely no double-line subtitles
-
🗣️ Uses methods like GPT-SoVITS for high-quality aligned dubbing
-
🚀 One-click integrated package launch, one-click video production in Streamlit
-
📝 Detailed logging of each operation step, supporting interruption and progress resumption at any time
-
🌐 Comprehensive multi-language support, easily achieving cross-language video localization
ru_demo.mp4 |
sovits.mp4 |
fishttsdemo.mp4 |
OAITTS.mp4 |
Currently supported input languages and examples:
Input Language | Support Level | Translation Demo | Dubbing Demo |
---|---|---|---|
English | 🤩 | English to Chinese | TODO |
Russian | 😊 | Russian to Chinese | TODO |
French | 🤩 | French to Japanese | TODO |
German | 🤩 | German to Chinese | TODO |
Italian | 🤩 | Italian to Chinese | TODO |
Spanish | 🤩 | Spanish to Chinese | TODO |
Japanese | 😐 | Japanese to Chinese | TODO |
Chinese* | 🤩 | Chinese to English | Professor Luo Xiang's Talk Show |
*Chinese requires separate configuration of the whisperX model, see source code installation
Translation languages support all languages that the large language model can handle, while dubbing languages depend on the chosen TTS method.
- The integrated package uses the CPU version of torch, with a size of about 2.6G.
- When using UVR5 for voice separation in the dubbing step, the CPU version will be significantly slower than GPU-accelerated torch.
- The integrated package only supports calling whisperXapi ☁️ via API, and does not support running whisperX locally 💻.
- The whisperXapi used in the integrated package does not support Chinese transcription. If you need to use Chinese, please install from source code and use locally run whisperX 💻.
- The integrated package has not yet performed UVR5 voice separation in the transcription step, so it is not recommended to use videos with noisy BGM.
If you need the following features, please install from source code (requires an Nvidia GPU and at least 20G of disk space):
- Input language is Chinese
- Run whisperX locally 💻
- Use GPU-accelerated UVR5 for voice separation
- Transcribe videos with noisy BGM
-
Download the
v1.4
one-click package (800M): Download Directly | Baidu Backup -
After extracting, double-click
OneKeyStart.bat
in the folder -
In the opened browser window, configure the necessary settings in the sidebar, then create your video with one click!
💡 Note: This project requires configuration of large language models, WhisperX, and TTS. Please carefully read the API Preparation section below
This project requires the use of large language models, WhisperX, and TTS. Multiple options are provided for each component. Please read the configuration guide carefully 😊
Recommended Model | Recommended Provider | base_url | Price | Effect |
---|---|---|---|---|
claude-3-5-sonnet-20240620 (default) | Yunwu API | https://yunwu.zeabur.app | ¥15 / 1M tokens | 🤩 |
deepseek-coder | deepseek | https://api.deepseek.com | ¥2 / 1M tokens | 😲 |
Note: Yunwu API also supports OpenAI's tts-1 interface, which can be used in the dubbing step.
Reminder: deepseek has a very low probability of errors during translation. If errors occur, please switch to the claude 3.5 sonnet model.
Which model should I choose?
- 🌟 Default use of Claude 3.5, excellent translation quality, very good coherence, no AI flavor.
- 🚀 If using deepseek, translating a 1-hour video costs about ¥1, with average results.
How to get an API key?
- Click the link for the recommended provider above
- Register an account and recharge
- Create a new API key on the API key page
- For Yunwu API, make sure to check
Unlimited Quota
, select theclaude-3-5-sonnet-20240620
model, and it is recommended to choose thePure AZ 1.5x
channel.
Can I use other models?
- ✅ Supports OAI-Like API interfaces, but you need to change it yourself in the Streamlit sidebar.
⚠️ However, other models (especially small models) have weak ability to follow instructions and are very likely to report errors during translation, which is strongly discouraged.
VideoLingo uses WhisperX for speech recognition, supporting both local deployment and cloud API.
Option | Disadvantages |
---|---|
whisperX 🖥️ | • Install CUDA 🛠️ • Download model 📥 • High VRAM requirement 💾 |
whisperXapi ☁️ | • Requires VPN 🕵️♂️ • Visa card 💳 • Poor Chinese effect 🚫 |
- Register at Replicate, bind a Visa card payment method, and obtain the token
- Or join the QQ group to get a free test token from the group announcement
VideoLingo provides multiple TTS integration methods. Here's a comparison (skip this if you're only translating without dubbing):
TTS Option | Advantages | Disadvantages | Chinese Effect | Non-Chinese Effect |
---|---|---|---|---|
🎙️ OpenAI TTS | Realistic emotion | Chinese sounds like a foreigner | 😕 | 🤩 |
🔊 Azure TTS | Natural effect | Inconvenient recharge | 🤩 | 😃 |
🎤 Fish TTS (Recommended) | Excellent | Requires recharge | 😱 | 😱 |
🗣️ GPT-SoVITS (beta) | Local voice cloning | Currently only supports English input Chinese output, requires GPU for model inference, best for single-person videos without obvious BGM, and the base model should be close to the original voice | 😂 | 🚫 |
- For OpenAI TTS, we recommend using Yunwu API;
- Azure TTS free keys can be obtained in the QQ group announcement or you can register and recharge yourself on the official website;
- Fish TTS free keys can be obtained in the QQ group announcement or you can register and recharge yourself on the official website
How to choose an OpenAI voice?
You can find the voice list on the official website, such as alloy
, echo
, nova
, and fable
. Modify OAI_VOICE
in config.py
to change the voice.
How to choose an Azure voice?
It is recommended to listen and choose the voice you want in the online experience, and find the corresponding code for that voice in the right-hand code, such as zh-CN-XiaoxiaoMultilingualNeural
.
How to choose a Fish TTS voice?
Go to the official website to listen and choose the voice you want, and find the corresponding code for that voice in the URL, such as Ding Zhen is 54a5170264694bfc8e9ad98df7bd89c3
. Popular voices have been added to config.py
, just modify FISH_TTS_CHARACTER
. If you need to use other voices, please modify the FISH_TTS_CHARACTER_ID_DICT
dictionary in config.py
.
GPT-SoVITS-v2 Usage Tutorial
-
Go to the official Yuque document to check the configuration requirements and download the integrated package.
-
Place
GPT-SoVITS-v2-xxx
in the same directory level asVideoLingo
. Note that they should be parallel folders. -
Choose one of the following methods to configure the model:
a. Self-trained model:
- After training the model,
tts_infer.yaml
underGPT-SoVITS-v2-xxx\GPT_SoVITS\configs
will automatically be filled with your model address. Copy and rename it toyour_preferred_english_character_name.yaml
- In the same directory as the
yaml
file, place the reference audio you'll use later, namedyour_preferred_english_character_name_text_content_of_reference_audio.wav
or.mp3
, for exampleHuanyuv2_Hello, this is a test audio.wav
- In the sidebar of the VideoLingo webpage, set
GPT-SoVITS Character
toyour_preferred_english_character_name
.
b. Use pre-trained model:
- Download my model from here, extract and overwrite to
GPT-SoVITS-v2-xxx
. - Set
GPT-SoVITS Character
toHuanyuv2
.
c. Use other trained models:
-
Place the
xxx.ckpt
model file in theGPT_weights_v2
folder and thexxx.pth
model file in theSoVITS_weights_v2
folder. -
Refer to method a, rename the
tts_infer.yaml
file and modify thet2s_weights_path
andvits_weights_path
in thecustom
section of the file to point to your models, for example:# Example configuration for method b: t2s_weights_path: GPT_weights_v2/Huanyu_v2-e10.ckpt version: v2 vits_weights_path: SoVITS_weights_v2/Huanyu_v2_e10_s150.pth
-
Refer to method a, place the reference audio you'll use later in the same directory as the
yaml
file, namedyour_preferred_english_character_name_text_content_of_reference_audio.wav
or.mp3
, for exampleHuanyuv2_Hello, this is a test audio.wav
. The program will automatically recognize and use it. -
⚠️ Warning: Please use English to name thecharacter_name
, otherwise errors will occur. Thetext_content_of_reference_audio
can be in Chinese. It's still in beta version and may produce errors.
# Expected directory structure: . ├── VideoLingo │ └── ... └── GPT-SoVITS-v2-xxx ├── GPT_SoVITS │ └── configs │ ├── tts_infer.yaml │ ├── your_preferred_english_character_name.yaml │ └── your_preferred_english_character_name_text_content_of_reference_audio.wav ├── GPT_weights_v2 │ └── [Your GPT model file] └── SoVITS_weights_v2 └── [Your SoVITS model file]
- After training the model,
After configuration, make sure to select Reference Audio Mode
in the webpage sidebar. VideoLingo will automatically open the inference API port of GPT-SoVITS in the pop-up command line during the dubbing step. You can manually close it after dubbing is complete. Note that this method is still not very stable and may result in missing words or sentences or other bugs, so please use it with caution.
Before starting the installation of VideoLingo, please ensure you have 20G of free disk space and complete the following steps:
Dependency | whisperX 🖥️ | whisperX ☁️ |
---|---|---|
Anaconda 🐍 | Download | Download |
Git 🌿 | Download | Download |
Cuda Toolkit 12.6 🚀 | Download | - |
Cudnn 9.3.0 🧠 | Download | - |
Note: When installing Anaconda, check "Add to system Path", and restart your computer after installation 🔄
Some Python knowledge is required. Supports Win, Mac, Linux. If you encounter any issues, you can ask GPT about the entire process~
-
Open Anaconda Prompt and switch to the desktop directory:
cd desktop
-
Clone the project and switch to the project directory:
git clone https://github.com/Huanshere/VideoLingo.git cd VideoLingo
-
Create and activate the virtual environment (must be 3.10.0):
conda create -n videolingo python=3.10.0 -y conda activate videolingo
-
Run the installation script:
python install.py
Follow the prompts to select the desired Whisper method, the script will automatically install the corresponding torch and whisper versions
-
Only for users who need to use Chinese transcription:
Please manually download the Belle-whisper-large-v3-zh-punct model (Baidu link), and overwrite it in the
_model_cache
folder in the project root directory -
🎉 Enter the command or click
OneKeyStart.bat
to launch the Streamlit application:streamlit run st.py
-
Set the key in the sidebar of the pop-up webpage, and be sure to select the whisper method
-
(Optional) More advanced settings can be manually modified in
config.py
-
UVR5 has high memory requirements. 16G RAM can process up to 30min, 32GB RAM can process up to 50min. Please be cautious with long videos.
-
There's a very small chance of 'phrase' errors occurring in the translation step. If encountered, please report.
-
The dubbing function's quality is unstable. For best quality, please try to choose TTS speed suitable for the original video. For example, OAITTS speed is relatively fast, while for FishTTS speed, please listen to samples before choosing.
This project is licensed under the Apache 2.0 License. When using this project, please follow these rules:
- When publishing works, it is recommended (not mandatory) to credit VideoLingo for subtitle generation.
- Follow the terms of the large language models and TTS used for proper attribution.
- If you copy the code, please include the full copy of the Apache 2.0 License.
We sincerely thank the following open-source projects for their contributions, which provided important support for the development of VideoLingo:
- Join our QQ Group: 875297969
- Submit Issues or Pull Requests on GitHub
- Follow me on Twitter: @Huanshere
- Visit the official website: videolingo.io
If you find VideoLingo helpful, please give us a ⭐️!