Long Ping Times
jqport opened this issue · 9 comments
Hi, this is looks to be a really neat project and I love the name.
When I tried to install via Helm chart the default Grafana chart showed very long "ping" times; circa 2 seconds for some larger clusters and even a ten node cluster is showing ~1 second ping times.
I set up another service to run requests against the /ping endpoint that Goldpinger exposes and saw the expected low ping times (circa a few mills).
Have you noticed this before with any of your clusters? Any tips on configuration gotchas that I may have missed?
Are we talking about the single popping times or the time to generate the full graph?
The Connections to Peers/Connections to Kubernetes API graphs. The 99% graph shows around 2 seconds in a 250 node cluster, 1.1 for the 95%, and 500 mills for the 50%.
I made a new build and instrumented out the calls a bit and the reported time seems accurate. But when I hit the /ping endpoints directly from pods with other tools like a timed curl the time seems much more reasonable (~a few mills).
It does sound slower than it should. I'm wondering if your pods didn't end up synchronising - there is unfortunately no jitter for the updater. How long does a call to /check_all on any instance take?
The /check_all calls are taking about 2.657. What do you mean by pod synchronization?
All right - I guess we'll need to dig a little deeper to see what's going on.
Is there anything noteworthy about your networking stack? Is there any other traffic?
It looks like we're hitting a bottleneck somewhere. One way of pinning it down would be to transform the daemonset into a deployment with anti affinity and decrease the number of replicas until we start seeing reasonable numbers. When we get there it should be easier to see what limit we're hitting.
Also, it might be useful to simulate 250 concurrent connections to a replica to see how well it fares when all neighbors are calling.
The 50p at 500ms for /ping is worrisome. Unless the metric/graph is wrong and shows /check times ?
Hi,
the problem with a large number of nodes (~250) was with default CPU requests/limits. After I've increased CPU (from 100m) it starts to show a way better results in graphics.
That makes a lot of sense. We should probably at least add a comment about that to the example yaml.
Closing now. Feel free to re-open if you find anything that needs fixing.