bloomberg/goldpinger

Proposal: Allow limiting number of pods a single goldpinger "pings"

stuartnelson3 opened this issue · 9 comments

For larger clusters, it becomes prohibitively expensive cardinality-wise to scrape and store the metrics from every pod connecting to every other pod (n^2 timeseries growth).

We aren't necessarily interested in seeing that 95% of pods aren't able to reach the 5% that are in a rack with some sort of networking issue, just that each pod is being contacted by a large enough subset of goldpingers that we can detect an issue, while still keeping the amount of prometheus metrics exported at a reasonable size.

Would you consider a flag that would indicate to goldpinger to select X (where X is configurable) nodes, and to ping against those?

If you're interested, we can talk about potential implementation details.

On a side note: Thanks! We've been using goldpinger in smaller k8s clusters and it's detected issues, but we can't use it in any larger clusters, hence the feature request.

I think that's a good idea, and it widens the number of use cases covered. Therefore: 👍

I'm wondering what would be the best way of choosing the subset of calls to make. We could probably just sort them, and then take N IPs that come after our own in that order, as a simple algorithm that makes it easy to pick N that covers all the nodes.

What did you have on your mind ?

On a side note: Thanks! We've been using goldpinger in smaller k8s clusters and it's detected issues, but we can't use it in any larger clusters, hence the feature request.

Awesome, glad you're finding it useful.

I was thinking of taking a look to see if there's a good pre-existing rendezvous hashing library, primarily to ensure randomness of selection.

https://en.wikipedia.org/wiki/Rendezvous_hashing

EDIT:
https://github.com/dgryski/go-rendezvous is probably good

So I guess we are talking about two use cases here:

  1. having all the clients independently agree on the same subset of nodes to ping; which is good to verify N nodes to be reachable from everywhere else, and that a rendezvous hashing would allow,

  2. having each node ping a predictable subset of N nodes, in a way that allows for N to be picked so that we can get all nodes pinged by some other nodes; which is good to verify that there are no entirely unreachable nodes.

I think that they both have merit, and given the simplicity of the code required, we could probably implement two new flags:

--ping-number to set the N to a value
--ping-algorithm that can take either rendezvous (default) or uniform.

What do you reckon ? Would you like to contribute to the project ?

Sounds good to me!

A third point that could be added is that as new nodes are added or old ones removed, a rendezvous hash should minimize the disruption in assigning nodes to clients. If we were to do a simple "assign the next N nodes in the list retrieved from the kubeapi to the client", a change in the list would result in pinging completely new nodes, which would cause a completely new set of timeseries to be created for each client.

I made a relatively simple implementation to start, but when checking behavior for of the lib for removing a node, found that the code panics. I'm hoping to get some time tomorrow to fix it, and hopefully open a PR then.

--ping-algorithm=uniform, what does uniform imply? that all nodes are being pinged? if so, it could be implied by --ping-number=0.

--ping-algorithm=uniform, what does uniform imply? that all nodes are being pinged? if so, it could be implied by --ping-number=0.

I mean the option 2) from my previous comment - so basically taking N next IPs from the sorted list after own IP. Perhaps uniform is a terrible name 😄

Playing devil's advocate (and as someone who maintains a project that has way too many knobs to configure), is there a reason someone would care about the distribution? I'm wondering if it's necessary.

Fair enough - let's address that when/if a need arises.

This has now been resolved via #53

It's available in version 1.5.0. Thanks @stuartnelson3 !