janhq/ichigo

epic: Introducing naturalness for interuption

Closed this issue · 4 comments

Goal

Currently if you interupt ichigo it will just start a new turn (at best)

Description

With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn

Resources

https://arxiv.org/abs/2406.15718

References

https://arxiv.org/abs/2406.15718

There have been a number of duplex models released recently, with two approaches to duplex

Approach 1: Model-level duplex (1 model with 2 input streams)
Moshi: https://arxiv.org/abs/2410.00037
Hertz-dev: https://github.com/Standard-Intelligence/hertz-dev
image

This approach requires specialized datasets, which have been scarce so far. However, this Duplex-UltraChat dataset could solve this issue. It will be costly and difficult to train, but once it is trained implementation should be easier.

Approach 2: System-level duplex (2 models with 2 input streams)
VITA: https://arxiv.org/pdf/2408.05211
image

System-level duplex dedicates a model to listen and to generate separately. The listening model acts as an advanced VAD+Turn Taking Predictor. An earlier concept by google for an ASR-only listening model also exists: https://arxiv.org/pdf/2208.13321, which can take into account acoustic queues. However with an LLM at the wheel, the linguistic context can be very useful in making better predictions.

Evaluation

Approach 2 allows us to train multiple smaller specialized models and use logic to control the overall system, which might produce a more explainable system, compared to approach 1, where the model makes all the decisions.
Another underappreciated downside of Approach 1 models, is that the model needs to constantly predict silence tokens for the entire time a user is speaking, whereas for approach 2 the listening model only needs to predict a single EOS token. Additionally, extra silence token predictions can lead to "phantom" latency in response.

That said, all these downsides could very well disappear with sufficient quality data and enough training. Duplex-UltraChat is a text-only dataset, so we could be training a model to make turn-taking predictions based only upon linguistic queues.

Proposal

With the T2S approach of ichigo, we can try training on Duplex-UltraChat with minimal changes. If it works, the changes in implementation we would need to make are as follows:

  1. Once the mic has been turned on, send WhisperVQ tokens to ichigo 2 seconds as a time.
  2. Based on what is generated by ichigo, intercept any "idle" token tags to prevent it from being sent to the TTS
  3. When a new non-idle token is generated by ichigo, and the TTS is still speaking, turn off the current TTS and start playing the new response.

This dataset could be useful as a test set, or add to the training data. Use the ASR transcript during training, but use the speech during testing.

https://magichub.com/datasets/multi-stream-spontaneous-conversation-training-datasets_english/

i want @bachvudinh and @tuanlda78202 to pair on this one

cc @dan-homebrew since i'm assigning Charles now

Close for now. We use VAD as a simpler alternative for now. We will work on this later on.