grafana/cortex-jsonnet

dealing with mostly broken Grafana dashboards

danfromtitan opened this issue · 6 comments

I built dashboards following the mixtool instructions, got the dashboards uploaded in Grafana and I found myself lost in a sea of "No data". Most dashdoards fall in that category: reads, writes, compactor resources, queries etc, only the object store shows half the graphs working.

Digging one of the queries tells me the issues are with how the queries were implemented:

sum(cluster_namespace_job:cortex_distributor_received_samples:rate5m{cluster=~"$cluster", job=~"($namespace)/(distributor|cortex$)"})

The metric above has different labels out of Prometheus, there is no way it would work out of the box:

cluster_namespace_job:cortex_distributor_received_samples:rate5m{job="cortex-distributor", namespace="cortex"}

More or less, same is true for all other "No data" charts, it looks like each needs to be changed to make it work.

I instaled Cortex from the latest version of your Helm chart (ver 0.6.0 at the time) and it does get the latest cortex image deployed, so the metric labels should be up to date.

Before I take on the monumental task of fixing these charts I wanted to ask, is there something I missed ? My expectation was, by havind the latest cortex deployed from the latest Helm chart, that most if not all of the dashboards provided would come with queries matching the metrics exposed.

I would configure your Prometheus jobs to add the following labels:

  • cluster: name of the K8S cluster
  • job: "/<deployment/statefulset>" (eg. if Cortex is deployed in the "cortex-01" namespaces then ingesters would have the job label "cortex-01/ingester)

Could you precise the job part ?

Currently my prometheus is sending an external_label for cluster and jobs are automatically name after the deployment like cortex-compactor

Could you precise the job part ?

The job label is a label that we expect to be added by Prometheus (configured in the Prometheus scraping config) whose value is <namespace>/<deployment|statefulset|daemonset> where <namespace> is the namespace of where the pod is running and <deployment|statefulset|daemonset> is the name of the pod's deployment/statefulset/daemonset.

There is a job label in the metrics, it's just not in the value format Grafana expects:

Prometheus records the label as: job="cortex-distributor"
Grafana regex looks to match the label as: job=~"($namespace)/(distributor|cortex$)"

There is no way these two would match. I'm still hanging on the though that these label values should match between what Prometheus collects vs what Grafana expects and I'm not sure how comes they are so far apart.

Job label value is just one example, there are other labels in Grafana queries that would prevent the results from showing in the dashboards.

Anyway, the way I intent to deal with this is to fork cortex-mixin and fix queries to match Prometheus labels as much as possible, then adjust dashboards manually for the remaining.

The job label is a label that we expect to be added by Prometheus (configured in the Prometheus scraping config) whose value is <namespace>/<deployment|statefulset|daemonset> where <namespace> is the namespace of where the pod is running and <deployment|statefulset|daemonset> is the name of the pod's deployment/statefulset/daemonset.

Are there relabeling rules documented somewhere ? I'm trying to eliminate the guessing factor from the efforts. I deployed Cortex from your helm chart and the metrics come straight out of the Service Monitor without any relabeling. I've added the recording rules from cortex-mixin next to that, but I didn't come across any scrape config requirements in my readings.

To follow-up on this issue, I opened cortexproject/cortex-helm-chart#233 to ensure container names produced by the Cortex Helm chart align with cortex-mixin container label values.
Other than that, I had to significantly modify cortex-mixin's to get the dashboards to work after deploying Cortex from the official Helm chart. Waiting for a small change in the helm chart before I can PR this change.