GoogleCloudPlatform/google-cloud-iot-arduino

SERVER: The connection is closed because the server is shutting down.

CptEzz opened this issue · 14 comments

I have been experiencing a disconnection bug inside IoT Core on and off for a couple of months now. I originally listed the bug on the Google Issue Tracker but have had no movement on it: https://issuetracker.google.com/issues/186032867?pli=1

To explain the issue, the device will stay connected to IoT Core for hours if not days with no issue but it randomly disconnects and will not get back online for a random about of time. I have attached the server log showing the error:
server-error.txt

I have a feeling it has something to do with the JWT tokens but this error doesn't seam to appear specifically when trying to refresh tokens, so I'm at a bit of a loss. As it's a very intermittent error I haven't been able to catch it on the device itself to see what it's saying unfortunately.

What do your device logs say? Have you ruled out a device shutdown/restart?

I haven't been able to get the logs on the device when the device does have this issue unfortunately. I have the serial monitor sitting open waiting for this issue to appear, but only ever seen it server side. I haven't been able to rule out shutdown no, however it has been sitting with a USB power supply on my desk this whole time. I would be surprised if it was a power issue. I also haven't got anything else running on the device, so no failures from outside sources would be present.

An update on this issue:

I'm still experience it and have yet to be able to see the device logs when it happens. However, there has been a response on the issue tracker which pointed towards limiting MQTT traffic, the keep alive time and idle time limit. I adjusted the keep alive to 600 and the timeout down to 1000 which are more inline with the examples, however this hasn't helped.

I recently checked the cloud logs again and noticed an interesting pattern, that all my devices in the field (8 total, one of them can be completely off on some days as designed) are being dropped within a couple minutes of each other on some days and on others only a couple of them are. These devices all have independent mains power in a commercial site, so power isn't an issue. I have attached the log that shows this;
Capture

I also don't believe it is a traffic limit issue as during these events the devices would only be sending their PINGREQ messages every 10 or 15 minutes depending on the device.

I'm at a bit of a loss here, has anyone experienced a similar issue? Or even seen this error message in their logs?

Hello did you managed to solve the isssue?

Not yet, I'm still poking around in the library to see what I can do. I have added the exponential backoff to all the disconnection handling methods and that has seamed to improve the stability. But I think that has just been a patch not the solution to the underlying problem.

I dont think is from library. i have the same issue on pi zero with paho-mqtt and i also try on_disconnect method reboot the pi (that means i cant spam the cloud with reconnect attempts) also i check if another device spamming with telemetry that time the broker and no didint find something. i am thinking for paid support from google

One more thing i use gateways so when this error occurs all devices detaching from the gateway with the same status code

image

Interesting you are finding the same issue with the Pi with a different library. I did end up sending a bug report to google but was told it was 'intended behavior' and the ticket closed. I'm also considering paid support as there is definitely something going on here that doesn't add up.

There is no documentation on this error code that I can find, and your issues are evident it's on the IoT Core side of things.

Which region IoT Core are you connecting to? I'm using asia-east-1

I am connecting to europe-west1
i dont know if the solution is to not keep the connection alive but connect only when you want to send and find other ways to monitor the internet status of the embed

Seems like you are all on the free use service. Going off the time stamps, i think you are exceeding the data rate you are allowed to send. You are being blocked due to excessive use and connected for too long. AWS, thingspeak etc do the same. Free, means for testing, but it is limited. I changed my connections to log-in, send data and log out. Nothing more than once per min.

Hello @svdrummer i dont think is free use service the plan is pay as you go if the data rate increase i have to pay

@bambachas there isn't a point to try reconnect when you need to send data from the device as it wouldn't be online if you need to send the device data. That isn't the whole point of mqtt, they should always remain connected.

@svdrummer I'm a paying GCP customer using their App engine services, this isn't a free tier issue. I do a PINGREQ every 15 minutes per device as that is the longest interval you can have. If this is more traffic then it's allowed then there is something wrong with all of their documentation and their service as a whole. Also if it was excessive use, it would be throwing a different error as per the documentation.

Coming back in for an update on my investigation and communication with Google Support.

I ended up paying for case management to get an answer as to why this issue is appearing and received the following response;

There are regular rollouts for the MQTT bridges which also results in the devices to be disconnected (even multiple times in a row). It can also happen, scheduled maintenance on machines that host the MQTT bridges.

In this particular case the error message tells exactly what has happened: "SERVER: The connection is closed because the server is shutting down." This error happens when our MQTT bridges are stopping intentionally due to either scheduled rollout/maintenance. The possibility of these kinds of events is also mentioned in the documentation [4]. ("Note that very occasional disconnects are expected as servers are updated and load balanced.") (https://cloud.google.com/iot/docs/support/troubleshooting#my_device_is_disconnected_from_the_mqtt_bridge)

So, this is an intended behavior, the devices have to be prepared for unexpected disconnections, they should apply a retry logic and reconnect accordingly.

From my understanding of this response, it looks as though this is a library issue not handling the disconnection properly, not an issue with IoT Core directly. Understandably this library is a hobby project and such might have some issues with these edge cases.

This seems to be a significant issue. Thank you for following up.