awslabs/aws-crt-nodejs

Websocket enters in interrupt/resume loop after for long run inactivity in browser

FedericoBiro opened this issue · 6 comments

Hello,
we are using angular framework, the iot authentication is set on the server by using this guide https://docs.aws.amazon.com/iot/latest/developerguide/authorizing-direct-aws.html
Token is set to endure 12 h.
The connection is built in this way:
this.client = new mqtt.MqttClient();
const config = iot.AwsIotMqttConnectionConfigBuilder.new_builder_for_websocket()
.with_clean_session(true)
.with_client_id(client_id)
.with_endpoint(AWS_IOT_ENDPOINT)
.with_credentials(AWS_REGION, credentials.accessKeyId, credentials.secretAccessKey, credentials.sessionToken)
.with_use_websockets()
.with_reconnect_min_sec(0.5)
.with_reconnect_max_sec(1)
.with_keep_alive_seconds(2)
.build();
connectionIot = client.new_connection(config);

We are facing a bug that happens when the browser is put in background or generally speaking when the browser lost the focus on the device (this can happens on windows/ios/android) after some time of inactivity the websocket starts to interrupt and resume very frequently and after a random time of repetitions the connection stays in this state of perpetuous loop interrupt/resumed also after bringing back the focus on the browser. This bug happens most commonly after a very long run of inactivity (like 1 hour but it can occurs also in short time like 10 minutes). Trying to disconnect and reconnect manually it does not work all the times because the new connection remains broken.
I suppose that the manual disconnect does not work properly because maybe the websocket can not send disconnect event to the server and this maybe leads to more errors.
Manual disconnect and reconnect process is done in this way:
connectionIot.removeAllListeners();
await connectionIot.disconnect();
const config = iot.AwsIotMqttConnectionConfigBuilder.new_builder_for_websocket()
.with_clean_session(true)
.with_client_id(client_id)
.with_endpoint(AWS_IOT_ENDPOINT)
.with_credentials(AWS_REGION, credentials.accessKeyId, credentials.secretAccessKey, credentials.sessionToken)
.with_use_websockets()
.with_reconnect_min_sec(0.5)
.with_reconnect_max_sec(1)
.with_keep_alive_seconds(2)
.build();
connectionIot = this.client.new_connection(config);

Is there a way to solve this?

I can't tell what exactly is wrong, but I can give some feedback.

The minimum value for keep alive supported by IoT Core is 30 seconds and anything below that is clamped to 30.

Some browsers (Safari seems to be the worst offender) will dramatically alter the JS execution environment for tabs in the background. For an internal ticket a while back I had a simple JS program schedule a recurrent task for every two seconds and print out the current time. I put the program in the background (nothing else going on in the browser or the host though) and the time values outputted had wild variation. 15-20 second gaps between task invocations at times.

When you combine this behavior with a very low keep alive time (30 seconds), you easily enter into situations where putting the program's tab into the background leads to a server-side disconnect because the keep alive PINGREQ packet (scheduled by mqtt-js to occur at the keep alive interval) gets delayed enough that the server shuts down the connection. People try to use keep alive as a way to detect disconnections faster, but it often just ends up making things more brittle.

You also have an extremely low setting for max reconnect.

Knowing nothing else, my advice would be to bump keep alive up a lot (10-20 minutes) and also consider bumping max reconnect up a lot as well (60+ seconds).

Hello,
we tested the app with the new configuration set up that has been suggested, but it seems that the bug persists. For sure there's been an improvement since the first configuration.
Since the problem still occurs we tried to set also the with_reconnect_min_sec up to 30 seconds and the with_reconnect_max_sec up to 120 while the keep alive is set to 500 seconds and another problem pop up. With longer time configuration now when the application goes offline, we do not receive "interrupt" event until the keep alive time is reached. This means that if the network goes down we do not have any tool to understand if the websocket is still up or down. Generally speaking if the device goes back online the websocket too seems to go back online but no event are sent from the library. By the way in some cases the connection loop problem still occurs and we lose also control over connection status.

I checked in the code and i saw that parameters like connection state and the control of the reconnect chron are set as private. At this point would not be better to set these params as public and let the control of the connection app-side?