serenity-rs/songbird

Possible current_timestamp desynchronization in `PlayoutBuffer`

Closed this issue · 3 comments

Songbird version: git serenity-next

Rust version (rustc -V): rustc 1.82.0-nightly (176e54520 2024-08-04)

Serenity/Twilight version: git next

Description:
Scripty has been having issues with VoiceTick returning empty audio ever since a few weeks before DAVE introduction (possibly related, not worth digging into yet I don't think). Digging through logs, I noticed lots of lines noting Witholding packet: ts_diff is -xxxxxx or Packet arrived beyond playout max length: wanted slot nn. I was going to open a bug report long ago but I finally got around to it.

Explore-logs-2024-10-29 05_34_55.txt

Attached is a log file from the most recent known event of a desync. The first line where audio seems to disappear is Witholding packet: ts_diff is -403200. The timestamp difference between that and the last audio packet from this speaker approximately nine seconds before is 412800, 9600 (or 10*MONO_FRAME_SIZE) off of the actual ts_diff value. Based off this very basic poking and prodding I suspect a possible race condition somewhere, but don't know enough about songbird internals to poke into it much.

There's been lots of comments from me in the serenity discord about this so I'll copy paste some of them here for a full thread: whatever follows is copied+pasted

there's at least three possible failure states i know of right now

  • VoiceTick is always silent for every user no matter what
  • VoiceTick returns audio for a handful of users for an arbitrary length of time and then goes silent forever for all
  • VoiceTick always returns data for some users but never others

and as far as i know the failure state varies from session to session
during all of this the WebSocket is unaffected and i still get speaker updates normally no matter what which triggers empty results getting sent to scripty's STT service which return unusually fast (that's the thing that originally tipped me off here)

I'm willing to put a $50 USD bounty on this issue, through GitHub Sponsors or kofi. Let me know fi any additional info or anything else is required to help debug :)

Was hoping this would be fixed somewhat with the first implementation of Discord's new encryption modes but it doesn't appear to be so unfortunately.

Since originally opening this issue, I have heard from some users who can usually get it working consistently if they always join after the bot (ie they never get hit by this issue)

I'm not sure about a race condition, but I'm not really able to reproduce this locally over a various voice servers (Amsterdam, US West, Russia).

Broadly, what I see locally is:

  • User joins after bot.
    • No issues, no surprising RTP behaviour. No long (1s) catchup period.
  • Bot joins after user.
    • Bot receives approx 50 packets (1 second) of audio with identical timestamp as user switches over from DAVE to bot-compatible mode. This is what #249 was intended to fix.
    • I tried this a fair few times and I always seemed to have correct playout (all packets find correct slot in PlayoutBuffer, decode to valid audio in sync with my speech -- no obvious gaps etc.). Keeping the mic hot for >30s got me nothing here.

So, there's something different between my setup and yours that you're getting a different pattern of sequence numbers for one or more users. I just don't yet know what it is.