Azure/azure-functions-kafka-extension

GroupCoordinator: *.*.*.*:9092: 1 request(s) timed out: disconnect with Azure Backed Service

TsuyoshiUshio opened this issue · 3 comments

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

I upgrade to Confluent 1.5.2. to solve this issue, however, it still remains. It looks solved by 1.6.0-PRE3+.

confluentinc/librdkafka#3109
#193

I can reproduce the issue with EventHubs with 5 minutes delay with KafkaTrigger.
I also make sure the new version solves.

mitigation

We provide pre-release for fixing this issue. This is not the official release, however, you can test if it help to resolve your issue.

https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.Kafka/3.3.1-PRE1

amotl commented

Dear Tsuyoshi,

can you confirm this is really coming from the infamous idle network connection drops by Azure LBs? Have you been able to reproduce it with librdkafka 1.6.0-PRE3 or even 1.6.0-PRE4?

From reading at the librdkafka issue tracker, you might want to run the client with debug=all in order to get more detailed insights.

While I can't say for sure this is related, I am also referencing confluentinc/librdkafka#2739 and confluentinc/librdkafka#2944 here. Please investigate both issues thoroughly and check if you can make any correlations with your observations.

With kind regards,
Andreas.

Thank you for your comment. @amotl . I mean I reproduced with 1.5.2. 1.6.0-PRE4 looks good. How can we confirm the issue happens that you mentioned?

amotl commented

Dear Tsuyoshi,

ah, I see.

Some of [our] users [tripped into] this issue. However, I can't have a confidence.
How can we confirm the issue happens that you mentioned [in order to gain more confidence]?

I want to apologize that I can't contribute much to your question, with respect to pinpointing to a specific aspect. However, I tried to share more details about our environment and respective observations at #193 (comment).

As outlined there, we have been approaching to mitigate this issue in a trial-and-error manner and just shared our observations.

With kind regards,
Andreas.