rabbitmq/rabbitmq-dotnet-client

Async API - socket read timeout has no effect

lukebakken opened this issue · 7 comments

Is your feature request related to a problem? Please describe.

The v7 async API uses .NET Pipelines and ReadAsync from a NetworkStream in SocketFrameHandler:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L284

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/RabbitMQ.Client/client/impl/Frame.cs#L257-L283

Even though we set a socket read timeout here, this will have no effect on an async read (docs)

So, this means that reads will never time out. This should be OK because we have heartbeats, but it would be great to have a test for this.

Describe the solution you'd like

A test for socket read timeout. It'll probably involve toxiproxy.

cc @danielmarbach @bording @bollhals

I'm pinging y'all because of what I've found with regard to an async read timeout. As I suspected, it can only be achieved using a CancellationTokenSource per read attempt:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1492/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L289-L294

This will probably affect performance negatively.

What I have found, however, is that the heartbeat mechanism does eventually figure out when no data is coming over the socket, and closes the connection. However, with RequestedHeartbeat set to 5 seconds, it takes about 45 seconds for the connection to actually close:

> .\projects\gH1492\bin\Debug\net6.0\GH1492.exe
[INFO] 2024-02-09T17:52:47 START TOXIC NOW
[INFO] 2024-02-09T17:52:58 publishing message: 0
[INFO] 2024-02-09T17:52:59 publishing message: 1
[INFO] 2024-02-09T17:53:00 publishing message: 2
[INFO] 2024-02-09T17:53:01 publishing message: 3
[INFO] 2024-02-09T17:53:02 publishing message: 4
[INFO] 2024-02-09T17:53:03 publishing message: 5
[INFO] 2024-02-09T17:53:04 publishing message: 6
[ERROR] 2024-02-09T17:53:48 Caught execption: RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=0, text='End of stream', classId=0, methodId=0, exception=System.IO.EndOfStreamException: Heartbeat missing with heartbeat == 00:00:05 seconds
   at RabbitMQ.Client.Impl.ChannelBase.WaitForConfirmsOrDieAsync(CancellationToken token) in C:\Users\lbakken\development\rabbitmq\rabbitmq-dotnet-client_main\projects\RabbitMQ.Client\client\impl\ChannelBase.cs:line 1922
   at GH1492.Program.AsyncMain() in C:\Users\lbakken\development\rabbitmq\rabbitmq-dotnet-client_main\projects\GH1492\Program.cs:line 40
[INFO] 2024-02-09T17:53:48 Closing channel and connection...
[INFO] 2024-02-09T17:53:48 Done closing channel and connection.

I feel like I should focus on getting the heartbeat mechanism a bit more reactive in this situation.

Thoughts?

This will probably affect performance negatively.

Can you expand a bit on what you mean by this?

Can you expand a bit on what you mean by this?

Well, there have been so many micro-optimizations in this library with regard to reducing object initialization and allocation I'm assuming the same would apply to this scenario. I can run some benchmarks, of course.

I think taking a close look at heartbeats needs to happen anyway.

Can you observe the timer queue counter behavior and take a dump? The timed CTS uses timers as far as I recall and it sounds like we are stalling there somehow.

You could also try to pool the CTS and try reset them to see if it makes the problem occur less which could give us more intel too

@danielmarbach there's no problem - when I use a cancellation token source with a timeout, the async read is cancelled as expected:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1492/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L289-L294

My concern is about the performance cost of setting up a CTS for every call to that method.

@lukebakken sorry I misunderstood the whole conversation.

I think I would get started with the version you have in the linked code. You can then throw it under a benchmark if needed to compare to a more optimized version. I was thinking a bit about an optimized version but it could become quite cumbersome and potentially error prone. For example you could optimize stuff around the fact that there is only one mainloop and the read timeout rarely ever changes. That would make it possible to have a cancellation token source managed by the frame handler that is always reset after every ReadFrameAsync call. You would then still have to have a linked token source that is linked to the token passed into ReadFrameAsync. Of course technically you could do a similar type of trickery to further assume the token passed there is only ever get cancelled when the main loop gets cancelled and cache that linked one too. Given though that this significantly increases the complexity of this code due to the branching and the coupling to the assumptions of how it is being called this type of "optimization" could bite back sooner rather than latter so I'm very skeptical if it is worth it and as said you would first have to measure the proposed changes against the version with resets and additional conditionals to see if it is indeed faster.

Maybe there are also other solutions but this is all I could come up with in the limited time I had available.

@danielmarbach I appreciate your input!

That would make it possible to have a cancellation token source managed by the frame handler that is always reset after every ReadFrameAsync call.

Yep, I considered that, but reset is not supported in netstandard2.0.

I am going to add tests to ensure the heartbeat and continuation timeout feature adequately covers the scenario of a socket read timeout. I'm pretty sure it will. In that case, socket read and write timeouts should no longer be necessary.