cortexlabs/cortex

Fix istio request rate discrepancy

Closed this issue · 0 comments

Description

The request rate per second as shown on an API's dashboard in Grafana doesn't reflect the reality. A couple of observations is showing that the istio_requests_total metric doesn’t make any sense.

  1. It looks like the final value baseline (without factoring in the ever increasing value) is 60 times bigger than the real value.
  2. The reported requests/s keeps increasing over time despite the cluster only taking in the same steady-state requests/s.
  3. When the requests stop flowing in, the graphs does not go to zero. It just stagnates at that high point.
  4. When the ingress gateway scales down, and then back up, the metrics from different ingress pods get mixed up together, thus making it look like there’s a single metric, when instead there are n ingresses. This creates mayhem in what the dashboard reports.

Reproducibility

Run a hello world realtime API with some replicas (i.e. 5) on a cortex cluster and hit it with the following load test:

hey -n 100000000000 -c 20 -m POST -q 1000 $ENDPOINT/hello-world

Let it run for a couple of minutes and then cancel the load test. Compare the number of requests as shown in Grafana with those reported by hey. There will be a big discrepancy.

How to "fix" this

Editing the istio-stats PodMonitor in the prometheus namespace and removing the metricRelabelings section will get it to work well. Obviously, that section it still very much needed, so it needs a fix.