Proposal: Allow limiting number of pods a single goldpinger "pings"

Question

Proposal: Allow limiting number of pods a single goldpinger "pings"

stuartnelson3 opened this issue 6 years ago · 9 comments

For larger clusters, it becomes prohibitively expensive cardinality-wise to scrape and store the metrics from every pod connecting to every other pod (n^2 timeseries growth).

We aren't necessarily interested in seeing that 95% of pods aren't able to reach the 5% that are in a rack with some sort of networking issue, just that each pod is being contacted by a large enough subset of goldpingers that we can detect an issue, while still keeping the amount of prometheus metrics exported at a reasonable size.

Would you consider a flag that would indicate to goldpinger to select X (where X is configurable) nodes, and to ping against those?

If you're interested, we can talk about potential implementation details.

On a side note: Thanks! We've been using goldpinger in smaller k8s clusters and it's detected issues, but we can't use it in any larger clusters, hence the feature request.

Answer 1 · 2019-02-22T16:05:33.000Z

I think that's a good idea, and it widens the number of use cases covered. Therefore: 👍

I'm wondering what would be the best way of choosing the subset of calls to make. We could probably just sort them, and then take N IPs that come after our own in that order, as a simple algorithm that makes it easy to pick N that covers all the nodes.

What did you have on your mind ?

On a side note: Thanks! We've been using goldpinger in smaller k8s clusters and it's detected issues, but we can't use it in any larger clusters, hence the feature request.

Awesome, glad you're finding it useful.

Answer 2 · 2019-02-22T17:07:30.000Z

I was thinking of taking a look to see if there's a good pre-existing rendezvous hashing library, primarily to ensure randomness of selection.

https://en.wikipedia.org/wiki/Rendezvous_hashing

EDIT:
https://github.com/dgryski/go-rendezvous is probably good

Answer 3 · 2019-02-27T22:18:20.000Z

So I guess we are talking about two use cases here:

having all the clients independently agree on the same subset of nodes to ping; which is good to verify N nodes to be reachable from everywhere else, and that a rendezvous hashing would allow,
having each node ping a predictable subset of N nodes, in a way that allows for N to be picked so that we can get all nodes pinged by some other nodes; which is good to verify that there are no entirely unreachable nodes.

I think that they both have merit, and given the simplicity of the code required, we could probably implement two new flags:

--ping-number to set the N to a value
--ping-algorithm that can take either rendezvous (default) or uniform.

What do you reckon ? Would you like to contribute to the project ?

Answer 4 · 2019-02-28T11:24:49.000Z

Sounds good to me!

A third point that could be added is that as new nodes are added or old ones removed, a rendezvous hash should minimize the disruption in assigning nodes to clients. If we were to do a simple "assign the next N nodes in the list retrieved from the kubeapi to the client", a change in the list would result in pinging completely new nodes, which would cause a completely new set of timeseries to be created for each client.

I made a relatively simple implementation to start, but when checking behavior for of the lib for removing a node, found that the code panics. I'm hoping to get some time tomorrow to fix it, and hopefully open a PR then.

Answer 5 · 2019-03-01T10:48:43.000Z

--ping-algorithm=uniform, what does uniform imply? that all nodes are being pinged? if so, it could be implied by --ping-number=0.

Answer 6 · 2019-03-01T17:19:57.000Z

--ping-algorithm=uniform, what does uniform imply? that all nodes are being pinged? if so, it could be implied by --ping-number=0.

I mean the option 2) from my previous comment - so basically taking N next IPs from the sorted list after own IP. Perhaps uniform is a terrible name 😄

Answer 7 · 2019-03-04T10:14:39.000Z

Playing devil's advocate (and as someone who maintains a project that has way too many knobs to configure), is there a reason someone would care about the distribution? I'm wondering if it's necessary.

Answer 8 · 2019-03-04T12:52:40.000Z

Fair enough - let's address that when/if a need arises.

Answer 9 · 2019-03-14T22:58:20.000Z

This has now been resolved via #53

It's available in version 1.5.0. Thanks @stuartnelson3 !