epic: Introducing naturalness for interuption
Closed this issue · 4 comments
Goal
Currently if you interupt ichigo it will just start a new turn (at best)
Description
With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn
Resources
https://arxiv.org/abs/2406.15718
References
There have been a number of duplex models released recently, with two approaches to duplex
Approach 1: Model-level duplex (1 model with 2 input streams)
Moshi: https://arxiv.org/abs/2410.00037
Hertz-dev: https://github.com/Standard-Intelligence/hertz-dev
This approach requires specialized datasets, which have been scarce so far. However, this Duplex-UltraChat dataset could solve this issue. It will be costly and difficult to train, but once it is trained implementation should be easier.
Approach 2: System-level duplex (2 models with 2 input streams)
VITA: https://arxiv.org/pdf/2408.05211
System-level duplex dedicates a model to listen and to generate separately. The listening model acts as an advanced VAD+Turn Taking Predictor. An earlier concept by google for an ASR-only listening model also exists: https://arxiv.org/pdf/2208.13321, which can take into account acoustic queues. However with an LLM at the wheel, the linguistic context can be very useful in making better predictions.
Evaluation
Approach 2 allows us to train multiple smaller specialized models and use logic to control the overall system, which might produce a more explainable system, compared to approach 1, where the model makes all the decisions.
Another underappreciated downside of Approach 1 models, is that the model needs to constantly predict silence tokens for the entire time a user is speaking, whereas for approach 2 the listening model only needs to predict a single EOS token. Additionally, extra silence token predictions can lead to "phantom" latency in response.
That said, all these downsides could very well disappear with sufficient quality data and enough training. Duplex-UltraChat is a text-only dataset, so we could be training a model to make turn-taking predictions based only upon linguistic queues.
Proposal
With the T2S approach of ichigo, we can try training on Duplex-UltraChat with minimal changes. If it works, the changes in implementation we would need to make are as follows:
- Once the mic has been turned on, send WhisperVQ tokens to ichigo 2 seconds as a time.
- Based on what is generated by ichigo, intercept any "idle" token tags to prevent it from being sent to the TTS
- When a new non-idle token is generated by ichigo, and the TTS is still speaking, turn off the current TTS and start playing the new response.
This dataset could be useful as a test set, or add to the training data. Use the ASR transcript during training, but use the speech during testing.
https://magichub.com/datasets/multi-stream-spontaneous-conversation-training-datasets_english/
i want @bachvudinh and @tuanlda78202 to pair on this one
cc @dan-homebrew since i'm assigning Charles now
Close for now. We use VAD as a simpler alternative for now. We will work on this later on.