istio/ztunnel

Optimize endpoint management for large services

howardjohn opened this issue · 1 comments

Currently, things are pretty slow when there is a Service with many endpoints. Our update (add or remove) operations do a clone+modify approach, and we process updates one-by-one.

At scale, this approximates N^2 behavior. For example, to modify 1 service from50k endpoints to 1, we do clone(50k) + clone(49999) + ....

In some testing, this can cause extreme performance degradation on the admin thread; the 50k removal was estimated to take ~40minutes to complete.

Two main approaches I think, possibly all of them:

  • Process XDS updates in batches, instead of one by one
    • Possibly even multiple XDS events at once, if we have a backlog?? Makes (N)ACK harder though.
  • Make our Service have interior mutability on the Endpoints. This way add/remove is just a simple hashmap add/remove (though it will probably not be just a simple hashmap, since we need some thread safety here)
  • Optimize how we handle the updates so we do less remove+add, when we don't need to remove?