Vaibhavs10/open-tts-tracker

Even more capability columns

Pendrokar opened this issue · 10 comments

Suggest adding more columns that would describe capabilities. Please comment on which of these you see as notable enough.

  1. GPU acceleration - Yes/No (CUDA/ROCm single/multi)
  2. Word pronunciation adjustment - None/IPA/ARPAbet/<other>
  3. Insta-clone - Yes/No (quick voice clone using a few audio samples, though already implied with TTS that do not have fine-tuning)
  4. Emotional control - Yes/Strict/No (Strict, as in has no ability to go in-between states)
  5. Prompting* - Yes/No (Often a side effect of narrator based datasets and a way to affect the emotional state)
  6. Streaming support - Yes/No (Is it possible to playback audio that is still being generated)
  7. Audio control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of generated speech)
  8. Per-phoneme control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of each uttered phoneme)
  9. Speech-To-Speech support - Yes/No (S2S capability lately seems to often come alongside TTS)

*Prompting as mentioned in ElevenLabs docs:
https://elevenlabs.io/docs/speech-synthesis/prompting

No doubt breaks the viewability of the table. So maybe can't have Yes/No as the header column cannot be frozen in place.

Here is how that would look like if xVASynth's row was filled and perhaps also leave the No cells empty:
https://github.com/Pendrokar/open-tts-tracker/blob/main/README.md

Use Shift+Scrollwheel 🖱 on that Table. I used the following Markdown Table Editor:
https://www.tablesgenerator.com/markdown_tables#

I think the best way to handle this is to make a github page and add toggles to it. Barring that, maybe emojis could be used to represent categorical stuff with less space.

Also cannot really add #2 or other performance related stuff as it depends on hardware.

I think the best way to handle this is to make a github page and add toggles to it. Barring that, maybe emojis could be used to represent categorical stuff with less space.

Plain near CSS-less GitHub page:
https://pendrokar.github.io/open-tts-tracker/

Hi @Pendrokar - I love it.. can we also look into populating the other model checkpoints as well?

cc: @fakerybakery - what do you think about this? I think another table in the README would be much better from a viewability perspective.

Hi,
A second table, or maybe something interactive online sounds like a great idea.
Have you considered making to a Gradio demo like the Open LLM Leaderboard? Then you can choose which columns you want using the checkboxes.

Also cannot really add #2 or other performance related stuff as it depends on hardware.

Most, if not all, TTS can have inference be run on CPU, though some would not recommend it. Meaning CPU would appear in all cells of the Processor column. So we have to pick some arbitrary "Real-time factor" number for a TTS to qualify. IMHO anything below a factor of 2.0 would be bearable. 1.0 would be near real-time if streaming is supported by the TTS.

Hi, A second table, or maybe something interactive online sounds like a great idea. Have you considered making to a Gradio demo like the Open LLM Leaderboard? Then you can choose which columns you want using the checkboxes.

To show as a DataFrame component? Yeah I tried that. But I was only able to run it on a Space with hardware rather than a static template. Gradio Lite would work, but the start-up time seems too long to me. Saving the final webpage also did not seem to work.

The PR for including the table has been accepted. Feel free to make your own PRs that changes information in the rows.