Improve Service/LoadBalancer reconciliation performance

Question

Improve Service/LoadBalancer reconciliation performance

desek opened this issue 7 months ago · 7 comments

What would you like to be added:

I'd like for a Service of type=LoadBalancer (a Service with a Public IP) to reconcile faster.
The current implementation only reconciles 1 Service at a time and --concurrent-service-syncs only allows 1 as a value.
This makes the reconcile loop, which processes all Services, to seqeuentially process 1 Service at a time.
In a cluster with 500+ Services the processing of each Service takes 5-10 seconds resulting in a reconciliation loop to take approx. 1 hour. Essentially making it a Service created just after the current reconciliation loop started taking at least double the time to reconcile (~2 hours).

I'm assuming Services are processed sequentially one-by-one due to the nature of Azure Load Balancers.

So the suggestion to improve performance in Service/LoadBalancer reconciliation either (or both):

Reconcile one Azure Load Balancer at the time instead of one Service
Make the cloud controller manager configurable to only reconcile Services based on label selectors

This would enable deployment of multiple cloud controller manager which would be dedicated for one Azure LB

Why is this needed:

Large cluster with Service of type=LoadBalancer won't scale without this

Dupliate issue in the AKS repo: Azure/AKS#4281

Answer 1 · 2024-05-06T11:38:41.000Z

@desek can you open this very same issue also at https://github.com/Azure/AKS/issues

The AKS Product Group monitors that repo and might consider your issue for their roadmap

Thanks

Answer 2 · 2024-05-06T11:57:07.000Z

As recently as February, @feiskyer stated this limit is still needed: #249 (comment) - I will ask for a re-evaluation. Thanks for the issue, @desek!

Answer 3 · 2024-05-06T12:14:49.000Z

Thanks for the feedback. This couldn't be supported with current LoadBalancer sku as lots of resources are shared, but it is under the plan with container native LoadBalancer (which is still WIP).

For the reconciling latency, have you tried NodeIP based SLB (e.g. set loadBalancerBackendPoolConfigurationType to nodeIP in the cloud configuration file)? VM Nic operations would be skipped with this nodeIP mode, hence its provisioning would be faster that the default mode.

Answer 4 · 2024-05-13T07:43:37.000Z

Thanks for the feedback. This couldn't be supported with current LoadBalancer sku as lots of resources are shared, but it is under the plan with container native LoadBalancer (which is still WIP).

For the reconciling latency, have you tried NodeIP based SLB (e.g. set loadBalancerBackendPoolConfigurationType to nodeIP in the cloud configuration file)? VM Nic operations would be skipped with this nodeIP mode, hence its provisioning would be faster that the default mode.

Yes, we're using nodeIP. It's not fast enough for clusters running 500+ services since the bottleneck is that the cloud-provider-azure is processing Kubernetes services sequentially.

Answer 5 · 2024-05-13T07:59:35.000Z

@desek can you open this very same issue also at https://github.com/Azure/AKS/issues

The AKS Product Group monitors that repo and might consider your issue for their roadmap

Thanks

I've added it here Azure/AKS#4281

Answer 6 · 2024-08-11T08:04:50.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 7 · 2024-09-10T08:32:31.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten