Optimize endpoint management for large services
howardjohn opened this issue · 1 comments
howardjohn commented
Currently, things are pretty slow when there is a Service with many endpoints. Our update (add or remove) operations do a clone+modify approach, and we process updates one-by-one.
At scale, this approximates N^2 behavior. For example, to modify 1 service from50k endpoints to 1, we do clone(50k) + clone(49999) + ...
.
In some testing, this can cause extreme performance degradation on the admin thread; the 50k removal was estimated to take ~40minutes to complete.
howardjohn commented
Two main approaches I think, possibly all of them:
- Process XDS updates in batches, instead of one by one
- Possibly even multiple XDS events at once, if we have a backlog?? Makes (N)ACK harder though.
- Make our Service have interior mutability on the Endpoints. This way add/remove is just a simple hashmap add/remove (though it will probably not be just a simple hashmap, since we need some thread safety here)
- Optimize how we handle the updates so we do less remove+add, when we don't need to remove?