Memory issue establishing connection to IoT Hub via DeviceClient

Question

Memory issue establishing connection to IoT Hub via DeviceClient

lsuryana-ibms opened this issue 2 years ago · 11 comments

I've created a simple test app that tries to open a connection to iot hub that doesn't exists. After 3 minutes, it's gone through all of my 32gb memory.

The same issue occurs when it has established a connection, but then i disable the device in iot hub and it's attempting to reconnect, again it will chew up all of my memory.

I think the issue is to do with the transport type, Mqtt_WebSocket_Only DOES NOT exhibit the memory issue, only Mqtt_Tcp_Only and Mqtt (which from the description uses tcp first)

ValidateIoTHubConnection_Error.zip

Context
OS, version, SKU and CPU architecture used: Windows 11 Business, 10.0.22621, x64
Application's .NET Target Framework : .Net 6
Device: Laptop
SDK version used: Microsoft.Azure.Devices.Client v1.41.3

Answer 1 · 2023-02-28T05:56:31.000Z

Hello @lsuryana-ibms Thanks for attaching the sample app that reproduces this issue.

Could you also add the following information to the issue description so that we can triage it better?

Context

OS, version, SKU and CPU architecture used: (Windows 10 Desktop x64, Ubuntu 15.04 x86, Windows 10 IoT Core arm32, etc.)
Application's .NET Target Framework : (See https://docs.microsoft.com/en-us/dotnet/standard/frameworks. E.g. netcoreapp2.1, net451, uap10.0, xamarin)
Device: (Laptop, Raspberry PI3, Android APIv25 etc.)
SDK version used: (Please include the NuGet package version for all involved components)

Answer 2 · 2023-02-28T06:06:38.000Z

I've updated the issue description.

Thanks

Answer 3 · 2023-02-28T20:48:38.000Z

Hi @lsuryana-ibms, thanks for sharing other details! Unfortunately I couldn't reproduce this issue, and the application stopped with Retry_Expired after a while.

Could you please share us your memory dump file so that we may have a better understanding about the memory issue?

FWIW, feel free to take a look at our device reconnection sample where you can refer to how to handle different pairs of ConnectionStatus and ConnectionStatusChangeReason, other than just waiting in the codes.

Answer 4 · 2023-02-28T23:50:33.000Z

Hi @brycewang-microsoft,

Interesting. I can't attached the memory dump with heap, because in 4 seconds it's gone to 2gb already.

I've taken a couple of screenshot, hope they can be useful.

ValidateIoTHubConnectionDump.zip

Answer 5 · 2023-03-06T19:39:50.000Z

Thanks for your patience!

I finally noticed that this issue could only be reproduced with the particular hostname which you shared in codes "asdfasdf.azure-devices.net". Previously I did testing with some other random hostnames, like "non-existing.azure-devices.net" or "aaa.azure-devices.net", and they all performed normally.

With a ping on this particular hostname, I got the following results. You can see it was pointing to a specific gateway

While a ping on other "truly" non-existing hub would give me this:

I think this issue would only happen when hitting a hub like this. How did you come up with this hostname?

Answer 6 · 2023-03-07T01:04:07.000Z

We started seeing the issue when the network goes down and the application was trying to connect to iot hub. On my machine, it doesn't matter what the hostname is, the memory would just shoot up when it's attempting to reconnect.

The 2 scenarios where it would replicate the issue are:

Connecting to a non existant host.
Disabling the iot device through iot hub.

You're right, using aaa.azure-devices.net seems to perform normally, which is the same behaviour as using asdfasdf.azure-devices.net and websocket protocol.

The asdfasdf.azure-devices.net was just some random string i use to try to replicate the issue connecting to a non existant host.

Answer 7 · 2023-03-07T01:41:56.000Z

I tested by following the repro steps you shared:

established a connection, then disable connection of the device on portal
Or
turned off networking on my laptop, then run your codes
Or
established a connection, then turn off networking on my laptop

All of 3 cases performed normally and the process memory was stable. Currently the only case I could reproduce this issue was trying to hit the hub with hostname "asdfasdf.azure-devices.net".

As pinging "asdfasdf.azure-devices.net" would be routed to a specific gateway named "gateway-prod-gw-eastus-5-tls10.eastus.cloudapp.azure.com [40.78.229.128]", I believe this hub is actually existing by coincidence. We don't know its configuration or status though.

Answer 8 · 2023-03-07T01:50:20.000Z

Interesting, if it's existing, shouldn't it then validate with the device id and key? I've tried passing in random device id for my own hostname and it returns a communication error with memory being stable.

Answer 9 · 2023-03-08T07:00:42.000Z

I internally checked the "asdfasdf" hub with service team and we could see it is currently active. Another important findings I got after testing further, this memory issue appears to be consistently reproducible with the Gateway V2 hubs and "asdfasdf" is one of them. Meanwhile, Gateway V1 hubs or non-existing hubs don't have this issue.

We are still investigating for this and will keep you posted here. Another thing I notice in your sample codes, to explicitly open connection with your device client, please use await deviceClient.OpenAsync() instead of deviceClient.OpenAsync().Wait(); otherwise, it's possible to cause deadlock issue.

Answer 10 · 2023-03-09T20:39:29.000Z

I tested this in several scenarios further with our V1 .NET SDK, and this issue could be reproduced only when establishing connection to an existing GatewayV2 hub, no matter using bad credentials or disabling devices in hub. After doing some research, I suspect this was caused by DotNetty natives in the Mqtt stack we use in our V1 SDK.

Furthermore, I tested the same scenarios in our V2 .NET SDK where we use a different Mqtt stack, and I didn't see this issue any more. The V2 SDK is still in preview, but feel free to test with it and share your feedbacks. The migration guide is here for your reference.

As for now, we may not touch with it in V1 SDK, but appreciate your reports regarding this issue!

Answer 11 · 2023-03-09T23:17:09.000Z

Perfect. Thanks for your efforts Bryce.

We've been using mqtt over websockets for now and will migrate to v2 when it's ready.

Thanks again!