πŸ—£οΈ Open TTS Tracker

A one stop shop to track all open-access/ source TTS models as they come out. Feel free to make a PR for all those that aren't linked here.

This is aimed as a resource to increase awareness for these models and to make it easier for researchers, developers, and enthusiasts to stay informed about the latest advancements in the field.

Note

This repo will only track open source/access codebase TTS models. More motivation for everyone to open-source! πŸ€—

Name GitHub Weights License Fine-tune Languages Paper Demo Issues
Amphion Repo πŸ€— Hub MIT No Multilingual Paper πŸ€— Space
AI4Bharat Repo πŸ€— Hub MIT Yes Indic Paper Demo
Bark Repo πŸ€— Hub MIT No Multilingual Paper πŸ€— Space
EmotiVoice Repo GDrive Apache 2.0 Yes ZH + EN Not Available Not Available Separate GUI agreement
Glow-TTS Repo GDrive MIT Yes English Paper GH Pages
GPT-SoVITS Repo πŸ€— Hub MIT Yes Multilingual Not Available Not Available
HierSpeech++ Repo GDrive MIT No KR + EN Paper πŸ€— Space
IMS-Toucan Repo GH release Apache 2.0 Yes Multilingual Paper πŸ€— Space
MahaTTS Repo πŸ€— Hub Apache 2.0 No English + Indic Not Available Recordings, Colab
Matcha-TTS Repo GDrive MIT Yes English Paper πŸ€— Space GPL-licensed phonemizer
MetaVoice-1B Repo πŸ€— Hub Apache 2.0 Yes Multilingual Not Available πŸ€— Space
Neural-HMM TTS Repo GitHub MIT Yes English Paper GH Pages
OpenVoice Repo πŸ€— Hub CC-BY-NC 4.0 No ZH + EN Paper πŸ€— Space Non Commercial
OverFlow TTS Repo GitHub MIT Yes English Paper GH Pages
Parler TTS Repo πŸ€— Hub Apache 2.0 Yes English Not Available Not Available
pflowTTS Unofficial Repo GDrive MIT Yes English Paper Not Available GPL-licensed phonemizer
Piper Repo πŸ€— Hub MIT Yes Multilingual Not Available Not Available GPL-licensed phonemizer
Pheme Repo πŸ€— Hub CC-BY Yes English Paper πŸ€— Space
RAD-MMM Repo GDrive MIT Yes Multilingual Paper Jupyter Notebook, Webpage
RAD-TTS Repo GDrive MIT Yes English Paper GH Pages
Silero Repo GH links CC BY-NC-SA No EM + DE + ES + EA Not Available Not Available Non Commercial
StyleTTS 2 Repo πŸ€— Hub MIT Yes English Paper πŸ€— Space GPL-licensed phonemizer
Tacotron 2 Unofficial Repo GDrive BSD-3 Yes English Paper Webpage
TorToiSe TTS Repo πŸ€— Hub Apache 2.0 Yes English Technical report πŸ€— Space
TTTS Repo πŸ€— Hub MPL 2.0 No ZH Not Available Colab, πŸ€— Space
VALL-E Unofficial Repo Not Available MIT Yes NA Paper Not Available
VITS/ MMS-TTS Repo πŸ€— Hub / MMS Apache 2.0 Yes English Paper πŸ€— Space GPL-licensed phonemizer
WhisperSpeech Repo πŸ€— Hub MIT No English, Polish Not Available πŸ€— Space, Recordings, Colab
XTTS Repo πŸ€— Hub CPML Yes Multilingual Paper πŸ€— Space Non Commercial
xVASynth Repo πŸ€— Hub GPL-3.0 Yes Multilingual Paper πŸ€— Space Copyrighted materials used for training.

Capability specifics

Click on this to toggle table visibility
Name Processor
⚑
Phonetic alphabet
πŸ”€
Insta-clone
πŸ‘₯
Emotional control
🎭
Prompting
πŸ“–
Speech control
🎚
Streaming support
🌊
S2S support
🦜
Longform synthesis
Amphion CUDA πŸ‘₯ 🎭πŸ‘₯ ❌
Bark CUDA ❌ 🎭 tags ❌
EmotiVoice
Glow-TTS
GPT-SoVITS
HierSpeech++ ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed / stability
🎚
🦜
IMS-Toucan CUDA ❌ ❌ ❌ ❌
MahaTTS
Matcha-TTS IPA ❌ ❌ ❌ speed / stability
🎚
MetaVoice-1B CUDA πŸ‘₯ 🎭πŸ‘₯ ❌ stability / similarity
🎚
Yes
Neural-HMM TTS
OpenVoice CUDA ❌ πŸ‘₯ 6-type 🎭
πŸ˜‘πŸ˜ƒπŸ˜­πŸ˜―πŸ€«πŸ˜Š
❌
OverFlow TTS
pflowTTS
Piper
Pheme CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ stability
🎚
RAD-TTS
Silero
StyleTTS 2 CPU / CUDA IPA πŸ‘₯ 🎭πŸ‘₯ ❌ 🌊 Yes
Tacotron 2
TorToiSe TTS ❌ ❌ ❌ πŸ“– 🌊
TTTS CPU/CUDA ❌ πŸ‘₯
VALL-E
VITS/ MMS-TTS CUDA ❌ ❌ ❌ ❌ speed
🎚
WhisperSpeech CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed
🎚
XTTS CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed / stability
🎚
🌊 ❌
xVASynth CPU / CUDA ARPAbet+ ❌ 4-type 🎭
πŸ˜‘πŸ˜ƒπŸ˜­πŸ˜―
per‑phoneme
❌ speed / pitch / energy / 🎭
🎚
per‑phoneme
❌ 🦜
  • Processor - CPU/CUDA/ROCm (single/multi used for inference; Real-time factor should be below 2.0 to qualify for CPU, though some leeway can be given if it supports audio streaming)
  • Phonetic alphabet - None/IPA/ARPAbet (Phonetic transcription that allows to control pronunciation of certain words during inference)
  • Insta-clone - Yes/No (Zero-shot model for quick voice clone)
  • Emotional control - Yes🎭/Strict (Strict, as in has no ability to go in-between states, insta-clone switch/🎭πŸ‘₯)
  • Prompting - Yes/No (A side effect of narrator based datasets and a way to affect the emotional state, ElevenLabs docs)
  • Streaming support - Yes/No (If it is possible to playback audio that is still being generated)
  • Speech control - speed/pitch/ (Ability to change the pitch, duration, energy and/or emotion of generated speech)
  • Speech-To-Speech support - Yes/No (Streaming support implies real-time S2S; S2T=>T2S does not count)

How can you help?

Help make this list more complete. Create demos on the Hugging Face Hub and link them here :) Got any questions? Drop me a DM on Twitter @reach_vb.