eclipse-mosquitto/mosquitto

QoS 2 handshake messages (PUBREC, PUBREL, PUBCOMP) delayed up to 1 minute

Opened this issue · 1 comments

Version: 2.0.18
OS: Debian Bookworm

I have a v5 MQTT bridge set up between two hosts. The bridge originating host running Mosquitto and the other host running an EMQX broker.

The host running Mosquitto is connected to the internet over a cellular network.

Mosquitto begins forwarding 4 relatively large messages (~1MB each) over the bridge.

In the middle of these messages being published, EMQX forwards a QoS2 message to Mosquitto.

Mosquitto receives this and logs that it sent a PUBREC for the message.

A Wireshark capture shows that the PUBREC message does not get sent at the time Mosquitto logged it.

30s later, EMQX attempts the PUBLISH again.

The same behavior ensues.

Mosquitto finishes sending one of the large messages it was in the middle of publishing, and then Wireshark shows that the previous PUBREC, which were claimed to be sent, finally get sent. They appear to be sent in tandem with the large message.

EMQX then responds with two PUBREL messages.

Mosquitto responds with a PUBREC.

In some instances of this bug, Mosquitto sends a disconnect request and the bridge is re-established.

Some relevant configuration:

max_inflight_messages is set to 10, and the EMQX client statistics show that the number of inflight messages never surpasses 5. The max_inflight_bytes is set to default (no cap).

There is only a single TCP stream in an MQTT connection, so it is not possible to send the PUBREC until the outgoing PUBLISH has completed. The log states that the broker is "sending" the PUBREC, it does not say it is sent. The packet is queued and will be delivered after the outgoing PUBLISH. I can see that the log message is confusing in this scenario, however what you describe seems to be exactly what I would expect from the Mosquitto side.

On the EMQX side, I'm afraid it doesn't obey the specification.

When a Client reconnects with Clean Start set to 0 and a session is present, both the Client and Server MUST resend any unacknowledged PUBLISH packets (where QoS > 0) and PUBREL packets using their original Packet Identifiers. This is the only circumstance where a Client or Server is REQUIRED to resend messages. Clients and Servers MUST NOT resend messages at any other time [MQTT-4.4.0-1].

So retrying messages on a stable connection is forbidden. Even without that requirement, it's logically a bad thing to do. With TCP the packets arrive in order, or not at all. If the connection has dropped, then the broker/client will retry messages as appropriate when the reconnection occurs. If the connection hasn't dropped, then we know the original message must have arrived and there is a valid reason for the reply to have been delayed. In this case it is the slow rate of the outgoing PUBLISH, in another case it may be that the client is otherwise overloaded. In neither case does it help if the broker retries the PUBLISH, and it has a good chance of making the situation worse. If there is an option to disable this kind of retry in EMQX I strongly recommend you do so. I'm amazed that they still do it.