Fix istio request rate discrepancy
Closed this issue · 0 comments
vishalbollu commented
Description
The request rate per second as shown on an API's dashboard in Grafana doesn't reflect the reality. A couple of observations is showing that the istio_requests_total
metric doesn’t make any sense.
- It looks like the final value baseline (without factoring in the ever increasing value) is 60 times bigger than the real value.
- The reported requests/s keeps increasing over time despite the cluster only taking in the same steady-state requests/s.
- When the requests stop flowing in, the graphs does not go to zero. It just stagnates at that high point.
- When the ingress gateway scales down, and then back up, the metrics from different ingress pods get mixed up together, thus making it look like there’s a single metric, when instead there are n ingresses. This creates mayhem in what the dashboard reports.
Reproducibility
Run a hello world realtime API with some replicas (i.e. 5) on a cortex cluster and hit it with the following load test:
hey -n 100000000000 -c 20 -m POST -q 1000 $ENDPOINT/hello-world
Let it run for a couple of minutes and then cancel the load test. Compare the number of requests as shown in Grafana with those reported by hey
. There will be a big discrepancy.
How to "fix" this
Editing the istio-stats
PodMonitor
in the prometheus
namespace and removing the metricRelabelings
section will get it to work well. Obviously, that section it still very much needed, so it needs a fix.