erew123/alltalk_tts

Feature Request V2 Beta: Ability to change the configuration file used in script.py

Closed this issue · 8 comments

Is your feature request related to a problem? Please describe.
I cannot run multiple instances of alltalk with independent configurations without creating copies and hand editing script.py and start_alltalk.bat The config file location is in hard coded to always be this_dir/"confignew.json" when loading and saving.

Describe the solution you'd like
I'd like to keep the default behavior to use the existing config file unless an environment variable is set such as ALLTALK_CONFIG

Additional context
I am working around another issue with alltalk which it doesn't support multiple simultaneous requests to the same backend. One way to work around this limitation other than have alltalk support it (I think because it is python + pytorch/cuda they alltalk would have to spawn multiple processes) is to run multiple copies of alltalk instead. I am writing a wrapper, in go because I know how to do this better than in python, to bootstrap multiple alltalk runtimes. However I need to create all the batch files by hand and make copies of script.py to point to different configurations to make sure I can run the alltext backend API on different ports for each copy. So the ability to just configure the config file used by script.py via an env variable seems like the easiest way to simplify that workaround for now.

Hopefully it is as simple as changing the config_file_path like so:

config_env_var = os.getenv('ALLTALK_CONFIG')
config_file_path = Path(config_env_var) if config_env_var else this_dir / "confignew.json"

and then in def save_config_data(): change it from using the path join to the config_file_path variable above? Maybe it needs to be wrapped in a path and grab the absolute() value and maybe there are other hard coded places in script.py, I am just not familar enough to know and debug to provide an update; just changing the load seems to work for me for now and why I don't have a full PR for you.

Perhaps you would like to have a look at AllTalk MEM. Its a work in progress at the moment.

image

Ive uploaded it to the beta and you would need to start the python environment and python ts_mem.py to start it.

Not sure if that may get you part way to what you want.

It has no command line arguments currently, nor stores any settings in a separate file at the moment, however it may prove useful for testing and get you some way to what you want to test.

Changing the config path is possible, but a complicated beast to attack, due to the fact there are so many parts of code reference that file.

Yes, wanted to come back and confirm that I also need to modify tts_server.py to get this working.

Hmm, this isn't a legitimate feature request to be able to specify another config file? I understand it might take a while but there are use cases other than mine to maybe want to change the file it points to easily.

I'll check out ts_mem.py to see what that's about! Thanks for the tidbit.

I added an update yesterday to tts_server.py to handle the changes needed.

What specifically would you want to be changing between config files? And what is the use case?

{
    "branding": "AllTalk ",
    "delete_output_wavs": "Disabled",
    "gradio_interface": true,
    "output_folder": "outputs",
    "gradio_port_number": 7852,
    "firstrun_model": false,
    "firstrun_splash": true,
    "launch_gradio": true,
    "transcode_audio_format": "Disabled",
    "theme": {
        "file": null,
        "class": "gradio/base"
    },
    "rvc_settings": {
        "rvc_enabled": false,
        "rvc_char_model_file": "Disabled",
        "rvc_narr_model_file": "Disabled",
        "split_audio": true,
        "autotune": false,
        "pitch": 0,
        "filter_radius": 3,
        "index_rate": 0.75,
        "rms_mix_rate": 1,
        "protect": 0.5,
        "hop_length": 130,
        "f0method": "fcpe",
        "embedder_model": "hubert",
        "training_data_size": 45000
    },
    "tgwui": {
        "tgwui_activate_tts": true,
        "tgwui_autoplay_tts": true,
        "tgwui_narrator_enabled": "false",
        "tgwui_non_quoted_text_is": "character",
        "tgwui_deepspeed_enabled": false,
        "tgwui_language": "English",
        "tgwui_lowvram_enabled": false,
        "tgwui_pitch_set": 0,
        "tgwui_temperature_set": 0.75,
        "tgwui_repetitionpenalty_set": 10,
        "tgwui_generationspeed_set": 1,
        "tgwui_narrator_voice": "female_01.wav",
        "tgwui_show_text": true,
        "tgwui_character_voice": "female_01.wav",
        "tgwui_rvc_char_voice": "Disabled",
        "tgwui_rvc_narr_voice": "Disabled"
    },
    "api_def": {
        "api_port_number": 7851,
        "api_allowed_filter": "[^a-zA-Z0-9\\s.,;:!?\\-\\'\"$\\u0400-\\u04FF\\u00C0-\\u017F\\u0150\\u0151\\u0170\\u0171\\u011E\\u011F\\u0130\\u0131\\u0900-\\u097F\\u2018\\u2019\\u201C\\u201D\\u3001\\u3002\\u3040-\\u309F\\u30A0-\\u30FF\\u4E00-\\u9FFF\\u3400-\\u4DBF\\uF900-\\uFAFF\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70-\\uFEFF\\uAC00-\\uD7A3\\u1100-\\u11FF\\u3130-\\u318F\\uFF01\\uFF0c\\uFF1A\\uFF1B\\uFF1F]",
        "api_length_stripping": 3,
        "api_max_characters": 2000,
        "api_use_legacy_api": false,
        "api_legacy_ip_address": "127.0.0.1",
        "api_text_filtering": "standard",
        "api_narrator_enabled": "false",
        "api_text_not_inside": "character",
        "api_language": "en",
        "api_output_file_name": "myoutputfile",
        "api_output_file_timestamp": true,
        "api_autoplay": false,
        "api_autoplay_volume": 0.5
    },
    "debugging": {
        "debug_transcode": false,
        "debug_tts": false,
        "debug_openai": false,
        "debug_concat": false,
        "debug_tts_variables": false,
        "debug_rvc": false
    }
}

I updated MEM today, to add some additional level of control:

image

And I have some in-progress work on a centralised queue aka, one single point to send TTS requests to, that load balances requests between running TTS engines.

image

Hey, cool, great work so far.

The configuration I am currently modifying, not that I may need it all in the long run, is the api port, various rvc settings depending on the model I need for the speaker and if it is enabled. I don't see it (?) but I eventually will want to select and configure settings for if it is using piper, vits, or xtts.

To give you some more context I writing a multiplexer, of sorts, to enable streaming among different engine instances, spawn more based on memory, and partition the engines based on the 'Speaker' selected which may be using a different model with or without the rvc pipeline enabled. This is to minimize churn reloading models and such as certain engines would be dedicated with certain presets. The client will queue up various speaker/text combo's and the plexer would send a stream of sound data back for each request as it becomes available directly to the client.

It sounds like your centralized queue may solve much of that use case depending on the factors it uses to balance across the engines. A lot of what you are doing might remove much of what I am managing and currently writing on my side. However, my idea might be going beyond the vision of All Talk's supported use cases.

If I think about what I think you are doing mem<->queue<->client the equivalent would be needing the queue to balance and match based on a certain identifier of speaker coming in from the client to pick a subset of engines configured correctly for that speaker to minimize loading/unloading of the models and latency caused because of it.

I don't mind writing the specialized solution for my use case. I think what you are planning though should work for a majority of users though to speed up results in a more real time system with a single speaker/single model use case.

In the near future I actually want to dig more into what you have and see if I can create a python script for each standalone engine that I can just spawn with cli parameters for all the settings and have a really bare bone api, maybe even just a grpc streaming server where it is text in->audio out . This would be more ideal for what I have currently built. I am happy to contribute any python script that can do that when I get that far. I am writing my multiplexer frontend in go and plan to open source that when it is working well.

Again, nice work on all this and I appreciate your time spent on this discussion.

So I have a few things that may/may not be of use to you.


MEM TTS Queue status

First off, the TTS API queue is working and configurable. At least as much as you can now send a TTS generation request to AllTalk MEM and it will multiplex between whatever instances are started.

So what I am saying is, you can send a:

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

and it will multiplex those requests between any instances that are loaded. Effectively this is a relay service, so if you send a bad request, it will get relayed onto a TTS engine and the response will get relayed back. And here is my brief explanation as to how its working:

Q: Is there an in-built queue system to handle different requests to different loaded engines?
A: MEM incorporates a built-in queue system to manage multiple TTS requests across loaded engine instances:

  • All TTS requests are received through the API port (default: 7401).
  • The queue system distributes incoming requests among available TTS engine instances.
  • If all engines are busy, new requests are held in a queue until an engine becomes available.
  • The system continuously checks for available engines to process waiting requests.
  • If a request cannot be processed within the allocated time, it will be marked as failed.

image

My intent would be to have all standard AllTalk API calls relayed through this, so you an query available voices etc. But this is not done yet, only TTS requests. Likewise, I have no idea if this will handle streaming TTS or not.

I also made a simple load-testing tool to see how the queue handles simultaneous requests

python mem_load_test.py --requests [number_of_requests] --length [text_length] --url "http://127.0.0.1:7501/api/tts-generate"


RVC Calls

I forgot to document:

rvccharacter_pitch
The pitch for the RVC voice for the character. Should be in the range -24 up to 24 with 0 being the central point and the global setting, set in the Gradio RVC page being used if the pitch is not specified in the TTS request.
-d "rvccharacter_pitch=3

rvcnarrator_pitch
The pitch for the RVC voice for the narrator. Should be in the range -24 up to 24 with 0 being the central point and the global setting, set in the Gradio RVC page being used if the pitch is not specified in the TTS request.
-d "rvcnarrator_pitch=3"

So these are now listed in the documentation. These are part of the standard API TTS generation request calls. e.g.

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "rvcnarrator_voice_gen=folder\voice2.pth" -d "rvccharacter_pitch=3" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

And to cover off the API calls from the above CURL example and how the pipeline works, there is this from the system\config folder:

AllTalk API Process

So the RVC voice is handled in the red section of the above. As long as RVC is Globally enabled (Gradio > Global Settings > RVC Settings > Enable RVC) which is global to AllTalk, irrespective of what TTS engine you have loaded.

As long as RVC has been enabled (which also checks for and downloads the base RVC models when you enable it in the Gradio interface) you can control the RVC Voice, but also the Pitch of the voice on each TTS generation request. All the other RVC settings are set in the settings page with no API calls to currently change those.

So maybe rvccharacter_pitch covers most of what you need.


Individual Engine Settings

The individual TTS engine settings are handled in their own JSON files in alltalk_tts\system\tts_engines\{engine-name}

These settings would be global to all instances that MEM starts.


Answering other questions

  1. spawn more based on memory
    Very tough one to handle this. Esp if you are dealing with engines that are CUDA based as CUDA can be funny about how it reports its memory management. Or at least, that is my current understanding of it. Ive not researched too far. MEM does allow you to limit the maximum amount of TTS engines you can start (or even run up as many engines as you want, into the 1000's). There is a possibility I may do something in MEM to spin up additional engines (to the max amount of engines allowed by the settings) if the queue length gets too long,

  2. and partition the engines based on the 'Speaker' selected
    At the moment, MEM is only allowing to load one type of TTS engine, which is whatever you set the default engine as in AllTalks Gradio interface. Obviously speaker is handled in the API request character_voice_gen. I cant speak for all TTS engines that may exist in future, but, Piper has to load in the voice model on its generation request and so does XTTS. It doesnt hold the audio sample or in Piper's case, the voice model loaded into memory (is my understanding), so partitioning may well not matter.

As mentioned though, I am only currently intending to make mem spin up one kind of TTS engine... currently. And respond as if it were just a standard AllTalk TTS server.

  1. which may be using a different model with or without the rvc pipeline enabled.
    As mentioned RVC pipeline is global and not TTS engine specific. Its pulled in/started IF the request is made by the API call (As shown above). There is no memory residency from enabling RVC globally, or after it has been used in an API request. This is just the way RVC's scripts currently work.

  2. This is to minimize churn reloading models and such as certain engines would be dedicated with certain presets. The client will queue up various speaker/text combo's and the plexer would send a stream of sound data back for each request as it becomes available directly to the client.
    As mentioned, Piper will always have to load the model in on each request. Just the way the Piper code works. XTTS and VITS, their models will remain loaded. VITS the speakers are built into the model, which is a whole new layer of complexity to handle with how multiple VITS engines being loaded would be handled (not even thought about that yet, but it would be messy). XTTS, the main model remains loaded, but the audio sample has to be provided on each TTS request and obviously, they are only 100-400kb each, so its not too much impact in the scheme of things.

All those portions of the scripts are outside of my control and managed/handled by the individual TTS manufacturer (see Manufacturer Website/TTS Engine Support in each engine information page within AllTalk's Gradio interface).

  1. the equivalent would be needing the queue to balance and match based on a certain identifier of speaker coming in from the client to pick a subset of engines configured correctly for that speaker
    As mentioned in 4, I dont think I will go to that level of control. But if you do want to specifically do it, you can do within your own queue system. But thought I would state my own goals.

So thats where I am at with things.

My time to touch/work on AllTalk is currently very limited, due to long term ill health of a family member and travelling between my place of residence and 100+ miles to their place of residence, with longer stays at their residence and no access to a system where I can code. As such any progress with AllTalk will be spotty at best for the ??? future. Though I may be taking a pop at MEM today and adding a few extra bits.

Wow, thanks for such the detailed response! Sorry to hear about the health issues you're having to take care of and I hope it works out well. No worries, I appreciate you and the discussions you have taken time for.

It makes sense you'll only support a single model for all engines, the use case I have there is atypical for sure and I am unsure if it will be even useful in the long run.

Fair point about that most of the engine models need to do some kind reload/pretrain step first. I think the main part of my use case though is that I will need partition/pin requests on the selected voice for rvc to save time of that reload given the inference time of most of the tts engines already; the rvc part is actually pretty fast if the voice model is already loaded.

I understand that doesn't align well with your current goals outlined, and that is a very reasonable stance. I really appreciate all the insight, thanks for the discussion!