fotcorn/e2e-tts

End-to-End Text To Speech (Webpage to .wav file)

Python

End-to-end Text to Speech

End-to-end processing engine from website text to speech audio output.

Features:

Extract main text from websites using trafilatura
Preprocess text with NVIDIA NeMO and some custom code
Split text into sentences, taking maximum number of tokens into account
Generate speech with Coqui XTTS-v2
Validate speech samples with whisper-timestamped, regenerating sample if necessary
Concat sentences into one WAV
Enhance WAV with noisereduce

Setup

Install the following dependencies: pip install noisereduce requests beautifulsoup4 trafilatura nemo_toolkit[all] whisper-timestamped TTS

License

Due to the dependency on whisper-timestamped, the whole project is licensed under APGL-3.0.