Long Ping Times

Question

Long Ping Times

jqport opened this issue 5 years ago · 9 comments

Hi, this is looks to be a really neat project and I love the name.

When I tried to install via Helm chart the default Grafana chart showed very long "ping" times; circa 2 seconds for some larger clusters and even a ten node cluster is showing ~1 second ping times.

I set up another service to run requests against the /ping endpoint that Goldpinger exposes and saw the expected low ping times (circa a few mills).

Have you noticed this before with any of your clusters? Any tips on configuration gotchas that I may have missed?

Answer 1 · 2019-11-12T21:41:44.000Z

Are we talking about the single popping times or the time to generate the full graph?

Answer 2 · 2019-11-12T21:47:44.000Z

The Connections to Peers/Connections to Kubernetes API graphs. The 99% graph shows around 2 seconds in a 250 node cluster, 1.1 for the 95%, and 500 mills for the 50%.

I made a new build and instrumented out the calls a bit and the reported time seems accurate. But when I hit the /ping endpoints directly from pods with other tools like a timed curl the time seems much more reasonable (~a few mills).

Answer 3 · 2019-11-13T20:41:09.000Z

It does sound slower than it should. I'm wondering if your pods didn't end up synchronising - there is unfortunately no jitter for the updater. How long does a call to /check_all on any instance take?

Answer 4 · 2019-11-13T21:42:53.000Z

The /check_all calls are taking about 2.657. What do you mean by pod synchronization?

Answer 5 · 2019-11-13T23:13:40.000Z

All right - I guess we'll need to dig a little deeper to see what's going on.

Is there anything noteworthy about your networking stack? Is there any other traffic?

It looks like we're hitting a bottleneck somewhere. One way of pinning it down would be to transform the daemonset into a deployment with anti affinity and decrease the number of replicas until we start seeing reasonable numbers. When we get there it should be easier to see what limit we're hitting.

Answer 6 · 2019-11-13T23:19:50.000Z

Also, it might be useful to simulate 250 concurrent connections to a replica to see how well it fares when all neighbors are calling.

The 50p at 500ms for /ping is worrisome. Unless the metric/graph is wrong and shows /check times ?

Answer 7 · 2019-11-29T10:10:18.000Z

Hi,
the problem with a large number of nodes (~250) was with default CPU requests/limits. After I've increased CPU (from 100m) it starts to show a way better results in graphics.

Answer 8 · 2019-11-29T10:39:45.000Z

That makes a lot of sense. We should probably at least add a comment about that to the example yaml.

Answer 9 · 2019-12-02T10:25:28.000Z

Closing now. Feel free to re-open if you find anything that needs fixing.