GroupCoordinator: ...:9092: 1 request(s) timed out: disconnect with Azure Backed Service

Question

GroupCoordinator: ...:9092: 1 request(s) timed out: disconnect with Azure Backed Service

TsuyoshiUshio opened this issue 4 years ago · 3 comments

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

I upgrade to Confluent 1.5.2. to solve this issue, however, it still remains. It looks solved by 1.6.0-PRE3+.

I can reproduce the issue with EventHubs with 5 minutes delay with KafkaTrigger.
I also make sure the new version solves.

mitigation

We provide pre-release for fixing this issue. This is not the official release, however, you can test if it help to resolve your issue.

https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.Kafka/3.3.1-PRE1

Answer 1 · 2020-11-20T07:48:32.000Z

Dear Tsuyoshi,

can you confirm this is really coming from the infamous idle network connection drops by Azure LBs? Have you been able to reproduce it with librdkafka 1.6.0-PRE3 or even 1.6.0-PRE4?

From reading at the librdkafka issue tracker, you might want to run the client with debug=all in order to get more detailed insights.

While I can't say for sure this is related, I am also referencing confluentinc/librdkafka#2739 and confluentinc/librdkafka#2944 here. Please investigate both issues thoroughly and check if you can make any correlations with your observations.

With kind regards,
Andreas.

Answer 2 · 2020-11-20T08:02:30.000Z

Thank you for your comment. @amotl . I mean I reproduced with 1.5.2. 1.6.0-PRE4 looks good. How can we confirm the issue happens that you mentioned?

Answer 3 · 2020-11-20T08:34:50.000Z

Dear Tsuyoshi,

ah, I see.

Some of [our] users [tripped into] this issue. However, I can't have a confidence.
How can we confirm the issue happens that you mentioned [in order to gain more confidence]?

I want to apologize that I can't contribute much to your question, with respect to pinpointing to a specific aspect. However, I tried to share more details about our environment and respective observations at #193 (comment).

As outlined there, we have been approaching to mitigate this issue in a trial-and-error manner and just shared our observations.

With kind regards,
Andreas.