emqx/emqx-bridge-mqtt

"Can't be found from the inflight" log spam

louisburton opened this issue ยท 24 comments

I am unsure if this is a duplicate of : Bridge connection loops messages back forever

Due to this reported issue, yesterday I rolled out a change to about 20 devices disabling the file persisted replayq, and upgrading from 3.2.2 to 4.1.0.
I have not used a receive_mountpoint (it wasn't available in 3.2.2 which we're maintaining the contract for), and have seen this occur on two devices since rolling out yesterday. This issue was resolved by restarting - but I am fearful of it occurring again or if I roll this out further.

Bridge config:

bridge.mqtt.cloud.username=uniquedeviceid
bridge.mqtt.cloud.batch_size=32
bridge.mqtt.cloud.queue.max_total_size=1000MB
bridge.mqtt.cloud.clean_start=false
bridge.mqtt.cloud.subscription.1.topic=ctoe/uniquedeviceid/#
bridge.mqtt.cloud.forwards=reg/service/#,etoc/service/#,logs/#
bridge.mqtt.cloud.proto_ver=mqttv4
bridge.mqtt.cloud.tls_versions=tlsv1.2,tlsv1.1,tlsv1
bridge.mqtt.cloud.start_type=auto
bridge.mqtt.cloud.bridge_mode=true
bridge.mqtt.cloud.keepalive=60s
bridge.mqtt.cloud.clientid=uniquedeviceid
bridge.mqtt.cloud.subscription.1.qos=1
bridge.mqtt.cloud.max_inflight_size=32
bridge.mqtt.cloud.ssl=on
bridge.mqtt.cloud.address=edge.xxx.com:443
bridge.mqtt.cloud.queue.replayq_seg_bytes=10MB
bridge.mqtt.cloud.reconnect_interval=30s
bridge.mqtt.cloud.forward_mountpoint=etoc/uniquedeviceid
bridge.mqtt.cloud.retry_interval=20s

On two out of about 20 devices affected, at different times overnight it started churning out :

...
2020-06-25 09:00:41.996 [error] [Bridge] Can't be found from the inflight:54787
2020-06-25 09:00:41.997 [error] [Bridge] Can't be found from the inflight:54788
2020-06-25 09:00:41.998 [error] [Bridge] Can't be found from the inflight:54789
2020-06-25 09:00:41.999 [error] [Bridge] Can't be found from the inflight:54790
...

Our logs are sent over MQTT and these were sent successfully despite these errors from the bridge, but effectively started Denial of Servicing our EMQX broker on the cloud.

Why am I getting this error? I don't see a loopback configuration error?
Regardless of error - it should not constantly spam the logs at a rate of ~200 messages per second.

https://github.com/emqx/emqx-bridge-mqtt/blob/v4.1.0/src/emqx_bridge_worker.erl#L521-L523
Is it possible this refactor has created this regression @turtleDeng ? #39

emqx.log

@zhanghongtong Unfortunately, the nature of this issue has meant we've been unable to upgrade to 4.x.x, this is the most critical of the issues we've encountered yet, because when it fails it effectively attacks our cloud MQTT broker.

Not sure if this needs assignment, or if you need any further information. Let me know if so - thanks again!

Hi, @louisburton
We are on vacation now, so please forgive us for the delay

No problem, sorry I did not realise this - thanks for letting me know and for being so responsive, even on your vacation! ๐Ÿ˜„

Thank you for your understanding. We will finish our vacation on Monday. Have a nice day :)

Hi, @louisburton I have reproduced this issue. It happens after shutting down the remote Broker for a long time. Then, the emqx_bridge_mqtt will cache a mass of message to bridge to the remote Broker.

The problem occurs when emqx_bridge_mqtt reconnects to the remote server and delivers a large number of messages continuously.

It should be a BUG in emqx_bridge_mqtt for re-delivering messages. Messages can be correctly sent to the remote Broker, even if it is printed on so many logs. But it's a bad experience!

Thanks very much for reproducing, hope you enjoyed your break ๐Ÿ˜Ž .
You're right that messages can be correctly sent to the remote broker, even if it prints that many logs - but that degree of influx (potentially ~200 messages per second, per device) is too much for us and means we'll need to wait for this fix before we can upgrade to 4.x.x and take other fixes.

@zhanghongtong thanks for all the assistance with fixes! Do you think this ticket will still hit 4.2.0, and do you know roughly when this might be released? I'm going to look at rate limiting to mitigate against this a little, but would really like to avoid this issue altogether.

@louisburton We will fix this problem in the next version, in fact my code has been submitted, you can check this PR

I have seen this log spam still from v4.1.2 - I'm not sure this is fixed by the PR @zhanghongtong
Perhaps emqx/emqx#3629 acknowledges that?

CC @HJianBo
please verify

Yes, this PR is not to fix the problem. We plan to fix it at v4.2.0

@HJianBo @zhanghongtong - do you know if this is already fixed in the v4.2-alpha.3 release? If not, do we know when a v4.2.0 release is expected? Or is there a workaround, or something we could do to avoid this occurring?

We have actually suppressed/ignored the log spam, but we have realised that when this occurs, the same MQTT payloads get sent repeatedly. Thus even without the log spam, we spam repeated messages indefinitely. As a result, this is a critical issue for us that prevents us from rolling EMQX further to our device fleet.

We've managed to address most the issues I've raised thus far, and I'd really like to keep our integration with EMQX continuing, so any help is greatly appreciated ๐Ÿ™ thx

Was using QoS 1 and tried CLEAN_START true.
Can confirm they don't help avoid this issue.

@louisburton Did you find a solution?

We've been experiencing the same and I am quite wondered why there is no stable mqtt bridge with persistent storage solution yet (same with other brokers as mosquitto or vernemq).

@JeremyTheocharis sorry for the slow response.

In honesty, I did not find a solution. After a number of issues with the bridge, which might all be resolved in the latest emqx version (haven't tested), we decided to switch to Mosquitto's MQTT bridge (which I believe does now have persistence).

@louisburton thank you for the response!

Although mosquitto is advertising with persistent bridges we never got it to work properly and this resulted in a lot of missed messages. We decided now to program our own mqtt-bridge, which is still lacking a lot of features, but at least it does not DDoS the server or loses messages. It definitely works with emqx-edge. If anyone is interested, feel free to contact.

Hi @JeremyTheocharis I would be interested in finding out more about your solution. We have a SaaS product relying EMQX Edge which is due to into beta phase for 1st customers in a few weeks and this issue is a major blocker.

I've added you to our (currently still) private repo and wrote you an email.

Hi @JeremyTheocharis thanks for adding me to the repo. I was able to join and can access it. However I haven't recevied a sperate email from you, can you double check it sent ok?

@gercorri I tried a second email address from your git commits. If this does not work as well feel free to contact me at my email adress (just made it public).

@JeremyTheocharis got it now, thanks.

Sorry I did not notice this thread until today.

I made a fix here: emqx/emqx#4513

@JeremyTheocharis if you don't mind, we'd love to know your fix.
or maybe help to review mine see if it makes sense, thanks.

will port the fix to this repo later.