KoljaB/AIVoiceChat

Wake up word trigger

Opened this issue · 6 comments

Hi!
One of my goals is to be able to trigger an animation based on a keyword or smiley included in the LLM answer.
Therefore, I could animate expressive faces on OBS for instance via websockets, by the script being triggered by the smiley (It's ignored be elevenlabs).
Not sure if it's the right project - it might be better for linguflex as a module, but is that possible?

Should by quite easy to do for both projects. Easy way filtering the incoming tokens for emoticons then triggering the animation. Probably a better, more reliable way would be to use a structured output library like instructor and force the llm to fill out a pydantic field with the desired expression.

ok, it might be above my skills :) I couldn't find a pseudo-code logic for being sure that the expressive face scene would be triggered while contextual speech was played. Just to be sure I explained myself correctly (by re-reading my ask, it's not sure) :
What I wish to do is to ask the AI to provide an answer including emotion cues in it so that while speaking we would see the face change.
The difficulty I see is that the cues must trigger the face change when the audio is read.
I first thought that a way would be to divide the answer by chunks, rename them according to the emotion, and when the file is played, the code would take the filename and send a trigger to OBS, but maybe there is an easier way.

What I wish to do is to ask the AI to provide an answer including emotion cues in it so that while speaking we would see the face change.
This is what structured output libraries are meant for. Instead of trying to filter the emotion cues out of one single big LLM response a library like instructor can split up the LLM answer in multiple parts. It could send a sentence and then an emotion cue together with that sentence, so that for every sentence of the LLM you would get the expression presented. Also you could restrict the LLM to only respond with certain emotion cues, that you could define before. I think this would be the gold standard of realizing your idea.

The difficulty I see is that the cues must trigger the face change when the audio is read.
This would the way, which requires analyzing the "standard output" of the LLM. I don't see this being very reliable. It's the classic "we beg the LLM to include certain stuff in the output without being sure that it does" way, which also involves parsing the output. Doable, but not the state-of-the-art way to achieve what you want to do.

Thank you for your taking the time to answer. So as I understand, the first would be the gold standard but I'm afraid of this adding latency as it would be one instruction at the time? Or would it be that it's one big answer containing multiple sentences along with their emotions?
Anyways, it's a bit aiming to high yet for my skills but it'd be cool to have this expressive module one day, that triggers whatever the user wants (could be a led color, eyes expression, face changing...)

With instructor you could make the LLM send a list of pairs of sentences and emotions and stream everything back token by token so you would have only minimal latency added. I've been thinking about an upgrade to my LocalAIVoiceChat project, where I plan to do this with different voice references for every emotion.

Look here (watch the little clip)