basicPublish can freeze for very long time on network interface removal
sebek64 opened this issue · 2 comments
- RabbitMQ version: 3.9.21
- Erlang version: 12.3.2.2
- Client library version: 5.16.0
- Operating system, version, and patch level: Linux, kernel 5.10.0
- Java: openjdk version "17.0.5" 2022-10-18 LTS
Rabbit client can freeze during writing to socket when the network interface is removed. For example, we can run an app in docker, disconnect the network with docker network disconnect ... command. If the connection is currently handling basicPublish, it is very likely that this call get stuck for a long time. No timeout configurations seem to help (SO_TIMEOUT, heartbeats, SO_KEEPALIVE, ...).
The thread is stuck with this stacktrace:
"DefaultDispatcher-worker-5" #315 daemon prio=5 os_prio=0 cpu=64.27ms elapsed=120.00s tid=0x00007fe9ecb2c650 nid=0x201 runnable [0x00007fe9d74f6000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.Net.poll(java.base@17.0.5/Native Method)
at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:181)
at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:190)
at sun.nio.ch.NioSocketImpl.implWrite(java.base@17.0.5/NioSocketImpl.java:415)
at sun.nio.ch.NioSocketImpl.write(java.base@17.0.5/NioSocketImpl.java:440)
at sun.nio.ch.NioSocketImpl$2.write(java.base@17.0.5/NioSocketImpl.java:826)
at java.net.Socket$SocketOutputStream.write(java.base@17.0.5/Socket.java:1045)
at java.io.BufferedOutputStream.flushBuffer(java.base@17.0.5/BufferedOutputStream.java:81)
at java.io.BufferedOutputStream.flush(java.base@17.0.5/BufferedOutputStream.java:142)
- locked <0x00000000c8b84988> (a java.io.BufferedOutputStream)
at java.io.DataOutputStream.flush(java.base@17.0.5/DataOutputStream.java:128)
at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)
at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)
at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)
- locked <0x00000000c8b2b308> (a java.lang.Object)
at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)
- locked <0x00000000c8b2b308> (a java.lang.Object)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)
...
Locked ownable synchronizers:
- <0x00000000c8b820b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
We can see that the sending buffer is occupied somehow in netstat output.
By the analysis of this library source code and NioSocketImpl sources, it is clear that the socket seems to be still in "recoverable" state. The flush call is blocked, the implWrite is still optimistic about the possibility to write more (but not yet).
Ideally, either the flush will throw an exception (but that doesn't happen), or we can detect "heartbeat timeouts" in this library and close the connection from outside.
If we try to implement this kind of behavior in the application itself, we fail. For example, if we time-out the basicPublish call and then try to close/abort the connection, it always tries to write something to the socket, so therefore it blocks as well.
For this reason, we believe that this is a bug in the library itself. However, very subtle and hard to fix.
This is a pretty esoteric situation. What would expedite us investigating it is if you provide a script or some other means that we can reproduce this easily. Ideally it would be as simple as docker compose up.
Thanks for quick feedback. I'll try to prepare a simple simulation script.