mas-bandwidth/yojimbo

reliable.io implicitly increments packetSequence during yojimbo's `SendPackets` - triggering an assert, effectively discarding reliable packets?

sfla opened this issue · 5 comments

sfla commented

By calling Client's SendPackets on every "tick" (even though we have nothing to send on either reliable or unreliable) the following code still gets called:

yojimbo.cpp:3258, SendPackets():
uint16_t packetSequence = reliable_endpoint_next_packet_sequence( GetEndpoint() );
As far as I can tell, this integer is supposed to indicate the next available sequence/index for our outbound packet. However, since we have no packets to send, it should not be used. Yet, it seems that yojimbo thinks there are packets to send.
Immediately after getting this packetSequence this happens:

if ( GetConnection().GeneratePacket( GetContext(), packetSequence, packetData, m_config.maxPacketSize, packetBytes ) )
{
    reliable_endpoint_send_packet( GetEndpoint(), packetData, packetBytes );
}

bool GeneratePacket is called, and will always return true (except for allocation edge case). When there are no packets to send, it will still be true. Its list of channels with data (yojimbo.cpp:2758) is completely empty, indicating that it knows there is no data to send, yet it still actually generates and writes a packet, on every single tick. Then GeneratePacket returns true, and reliable_endpoint_send_packet from reliable.io is always called.
During this call the following code will always be executed:
uint16_t sequence = endpoint->sequence++.
It increments endpoint->sequence globally.

I haven't been able to figure out what's actually happening, or what's supposed to be happening. When calling SendPackets without having added any messages to send on any channel, the packetData = GetPacketBuffer() actually returns the data \0, making the packetBytes equal to 5, then later calling reliable_endpoint_send_packet though I'm not sure if it actually sends anything. By implementing reliable_log_level debug, it does look like stuff is sent every tick.

Now - what's the implication of this?
Well, packetSequence is incremented on every single tick. So let's say I send ten packets, the last of which has the packetSequence 9, then I let it call SendPackets without having dispatched any messages to send, for, let's say approximately 32765 ticks, then our packetSequence is 32775, or implicitly actually wrapped around to be 5. Then when sending an actual message on this tick - there is code in AddMessagePacketEntry (m_sentPackets->Insert) that checks if the previously sent packet's sequenceNumber (9) is lower than our currently outbound packet (5). Since it isn't, we hit an assert in AddMessagePacketEntry because the very real packet we're about to send has resolved to NULL due to this wrapping-check.

So - are we not supposed to call SendPackets on every tick, like all the examples in the repo and documents are doing? Or have I done something completely wrong?
This is only an issue with Reliable-channel, as the unreliable channel just discards the sequencePacket as far as I can tell. But it still gets incremented during outbound unreliable packets, so if I put up a guard before SendPackets to check if we have any messages to send - this would still be an issue if I send ~32765 unreliable packets before sending the next reliable one. So if we're not supposed to call SendPackets unless there are reliable messages to send, how are we supposed to prevent the sequence-number from being incremented while sending unreliable packets to ensure that any future reliable packet won't get caught in this greater-than-wrapping-check?

This is such a big issue that I assume it's my fault, seeing as this library is used by so many without this issue. So what can I have possibly done? I can observe the same incrementing behaviour in the premake5 solution (Windows, VS2019), so I don't understand what I've done.

You are supposed to call SendPackets regularly.

It's intended to be OK to call, even if you don't have messages to send.

The idea is that the underlying UDP packets act as a carrier wave, in which reliability is performed on a per-packet basis (not a per-message basis).

The primary use case for this library is regularly sending unreliable state data in each packet via unreliable channel, and then tacking on a few reliable-ordered messages at the end, now and then, and the packet level reliability is used to work out when reliable messages need to be resent due to packet loss.

There is probably a bug in the reliable channel related to the wrapping check, because in my test cases, I regularly send unreliable and reliable messages... long periods, including wrapping of the packet (reliability) sequence number, and then sending reliable ordered messages, seems to be the trigger for the behavior you are seeing.

For example, I'll bet if you sent a dummy reliable-ordered message once per second, the issue would go away.

sfla commented

Ok, I've done some testing, and I still don't understand how this hasn't been raised earlier.

My understanding is that packetSequence is an ever-increasing uint16, which has a max value of 65535 before it wraps around. When sending an actual reliable message the logic in Insert() (yojimbo.h:1407) checks if this outbound packet's sequenceNumber is on the "half before" or "half after" the previously sent reliable packet's packetSequence, relative to wrapping around 65535 (half being 32768, which is used in sequence_greater_than()).
This means that if one packet is sent with packetSequence 0, and then the next packet is sent at tick 32767, the second packet is correctly considered "after" the first packet.
However, if the second packet is sent at 32769, it is considered "before" the first packet due to wrapping.
The packet then resolves to NULL during this tick, but it is not discarded. It is still queued. However, this check will be met every single tick. So the same thing will happen in the following 32767 ticks, and this message will not be sent until packetSequence wraps around and reaches 1, because that is the first upcoming packetSequence that is after 0. The packet is actually sent 32767 ticks too late because of this.

This is obviously the worst-case scenario. If I wait even longer, for example uint16.max + 1 ticks after the first message, it has then wrapped around making the next packetSequence 1 again, which gets sent immediately.

The primary use case for this library is regularly sending unreliable state data in each packet via unreliable channel, and then tacking on a few reliable-ordered messages at the end.

This issue happens regardless of unreliable packets. Since the unreliable channel doesn't touch the packetSequence it is irrelevant if any unreliable packets are sent at all. I can send an unreliable packet on every single tick between reliable tick 0 and reliable tick 32769, but the latter will still have to wait for packetSequence to wrap around to 1.

I'll bet if you sent a dummy reliable-ordered message once per second, the issue would go away

Yes. If we have a preset tick rate at e.g 60hz I could even send a dummy reliable message every 9 minutes to avoid this issue. However, this is a terrible "solution".
We probably will end up with a game that regularly sends reliable packets for important state syncs more often than every 32768 tick. But it has been a pain in the ass in this development phase where we've only implemented initial sync and chat as reliable, hitting asserts when sending a chat message after a while. And if we disable the asserts then the chat message could take a really really long time to show up, because it has to wait up to 32768 ticks to be sent, which resolves to up to 9 minutes delay if at 60hz tick. Of course, this would only happen if no other reliable message was ever sent in the previous 9 minutes. But that has already happened to us a few times- leading to this discovery by accident.

The idea is that the underlying UDP packets act as a carrier wave, in which reliability is performed on a per-packet basis (not a per-message basis).

I didn't understand this, or why this means that packetSequence has to be incremented for packets that aren't even sent, unless you're implying that every SendPackets() actually sends a reliable packet regardless of messages, which I doubt.
I'm sure there's a reason, but I haven't completely wrapped(lol) my head around the flow of information for the reliable channel, so I don't see any reason why packetSequence should be incremented when no packets are sent.

Fyi, this is easily reproducible by ticking every 1 microsecond. It advances to packetSequence 32768 in no time at all.

Fyi, this is easily reproducible by ticking every 1 microsecond. It advances to packetSequence 32768 in no time at all.

That's cool. Thanks for the repro.

You have a workaround so I'm closing this issue. Thanks!

Thanks for the fix, @jorgenpt ! I hope this gets pulled into the official library, because it's pretty significant issue for applications where the unreliable channel is used very often and the reliable channel is seldomly used (e.g. just for chat messages).