Azure/azure-functions-kafka-extension

Review default values for configuration settings

amotl opened this issue · 8 comments

amotl commented

Dear @TsuyoshiUshio and @fbeltrao,

firstly, I would like to apologize for bumping into the review process at #186 (comment) and #186 (comment). Let's better continue a more thorough discussion here.

As a followup to #185, I would like to dedicate this topic a separate issue regarding

I would like to emphasize that it is probably also crucial to apply the recommended configuration settings for Kafka on Azure as outlined within confluentinc/librdkafka#3109 (comment).

As we found out recently, Microsoft published some recommended configuration properties for librdkafka at [1]. It would be nice if the Azure function bindings for Kafka would already use these recommendations out of the box. I didn't check the code base thoroughly yet, so I am humbly asking for your support on that for the upcoming release when the bump to librdkafka 1.5.2 will take place.

As outlined within [1], the configuration properties socket.keepalive.enable and metadata.max.age.ms are crucial to get right within Azure environments in order to save users from running into issues like outlined within confluentinc/librdkafka#3109. I strongly recommend to apply these default values out of the box as most regular users will not be aware of them and might experience nasty things.

Property Recommended Values Permitted Range Notes
socket.keepalive.enable true   Necessary if connection is expected to idle. Azure will close inbound TCP idle > 240,000 ms.
metadata.max.age.ms ~ 180000 < 240000 Can be lowered to pick up metadata changes sooner.

While these settings [1] are primarily dedicated to Event Hubs, they will also apply to all communications with vanilla Kafka server components, as the underlying networking infrastructure problem outlined within confluentinc/librdkafka#3109 will be the same.

Thanks for listening and with kind regards,
Andreas.

[1] https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-configurations#librdkafka-configuration-properties

amotl commented

Dear @TsuyoshiUshio,

based on my observations at #186, I believe making topic.metadata.refresh.fast.interval.ms tuneable is not so important and you might want to get rid of this again.

However, getting the default values for the settings socket.keepalive.enable and metadata.max.age.ms right from the beginning will bring more value with respect to a reasonable out-of-the-box experience to all future users of this extension.

So, I am humbly asking you to focus on the SocketKeepaliveEnable and MetadataMaxAgeMs settings instead and apply the respective default values if nobody objects.

SocketKeepaliveEnable = true
MetadataMaxAgeMs = 180000

With kind regards,
Andreas.

Hi @amotl
Thank you for your jump in. That upgrade is trying to include your request. I'd like to achieve your goal.
Could you help me to understand and clarify it?

The reason why I include topic.metadata.refresh.interval.ms is I found this
image
on your post confluentinc/librdkafka#3109

I just want to achieve your goal. So I'd like to know the exact parameter should I add to achieve your goal.

According to the https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md documentation, metadata.max.age.ms default value is this. That means, to solve the issue, we need make it configurable metadata.max.age.ms not topic.metadata.refresh.interval.ms. Am I right?

topic.metadata.refresh.interval.ms

Metadata cache max age. Defaults to topic.metadata.refresh.interval.ms * 3
Type: integer

Currently, we have these configuration for librdkafka

image

In short, For achieve your goal. We need to make it configurable these parameters and you think it might be better to set it as default, right?

SocketKeepaliveEnable = true
MetadataMaxAgeMs = 180000

I agree with you we should make it default since the issue is hard to find and I suppose a lot of customer use this on Azure. The reason is, for this library, customer can choose the version by themselves. It will be breaking change however, a lot of use case is for azure. So I think I add the parameter on the releasenote and avoid the breaking change you can configure these parameters. I guess, that parameter doesn't affect a much. MetadataMaxAge is 900000 by default. It will introduce more frequent refresh on the cache. Make it KeepAlive is preferable for Azure Functions.

What do you think? @fbeltrao @ahmelsayed ?

NOTE: Why we need it.

To shorten TL;DR: Azure has a nasty artificial limitation that results in being unable to use long-lived TCP connections that have >= 4 minutes of radio silence at any given point.
They screwed it up so hard that when connection does timeout, they acknowledge the following TCP packets with an ok flag that makes the sender think “everything is okay - the data I sent was received succesfully”, which is 100 % unacceptable way to handle error conditions.

This caused me so much pain and loss of productive work time.

-- https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

confluentinc/librdkafka#3109

amotl commented

Dear @TsuyoshiUshio,

thanks for considering to bring in the respective default values for the configuration properties which are of primary concern here.

While I recognized topic.metadata.refresh.interval.ms has been mentioned within a quote coming from Magnus Edenhill I've cited within my report, I don't know much about this parameter yet. And I don't know much about topic.metadata.refresh.fast.interval.ms either.

However, as noted above, I believe the configuration properties socket.keepalive.enable and metadata.max.age.ms are crucial to get right within Azure environments. As noted in my primary post, these are the official recommended settings on Azure Event Hubs for both of them.

Property Recommended Values Permitted Range Notes
socket.keepalive.enable true Necessary if connection is expected to idle. Azure will close inbound TCP idle > 240,000 ms.
metadata.max.age.ms ~ 180000 < 240000 Can be lowered to pick up metadata changes sooner.

We need to make it configurable these parameters and you think it might be better to set it as default, right?

SocketKeepaliveEnable = true
MetadataMaxAgeMs = 180000

Exactly. Both setting these and upgrading to librdkafka 1.5.2 improved things tremendously for us.

With kind regards,
Andreas.

[1] https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-configurations#librdkafka-configuration-properties

Thank you for helping me to understand! I'll implement it. For the breaking change part, I'll wait the other guys opinion. However, I'll add the properties at least.

I also add the properties that is described as recommended for EventHubs.
https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-configurations#librdkafka-configuration-properties

I investigate other properties however, all of them is already configured as the ideal configuration. I just add these two.

SocketKeepaliveEnable = true
MetadataMaxAgeMs = 180000

Hi @amotl
Sounds good? #186

amotl commented

Dear @TsuyoshiUshio,

that looks good so far. Thank you very much!

With kind regards,
Andreas.

amotl commented

Thanks a again for considering these improvements! Closing this now.