m-lab/mlab-ns

prom queries used by mlab-ns miss lame duck values

Closed this issue · 4 comments

During the Japan incident we discovered that HND01 could not be placed in lame duck mode because of a subtle bug in the prom query used.

The bug is the compound use of OR which requires that every metric have a distinct set of labels. When two or more metrics have the same label set then prometheus will select one (evidently the first) and discard all others.

In our query, the lame duck check comes last, and was being ignored.

As evidence:

The abbreviated query using the metrics with matching label sets (since both come from node exporter):

min by (experiment, machine) (
      (vdlimit_used{experiment="ndt.iupui", machine=~".*hnd.*"} /
          vdlimit_total{experiment="ndt.iupui", machine=~".*hnd.*"}) < bool 0.95 OR 
   lame_duck_experiment{experiment="ndt.iupui", machine=~".*hnd01.*"} != bool 1)

Returns:

{experiment="ndt.iupui",machine="mlab1.hnd01.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab2.hnd01.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab2.hnd02.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab1.hnd02.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab3.hnd02.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab3.hnd01.measurement-lab.org"} | 1

Whereas when we explicitly relabel the lame duck metric to have differing label sets:

min by (experiment, machine) (
      (vdlimit_used{experiment="ndt.iupui", machine=~".*hnd.*"} /
          vdlimit_total{experiment="ndt.iupui", machine=~".*hnd.*"}) < bool 0.95 OR
      label_replace(lame_duck_experiment{experiment="ndt.iupui", machine=~".*hnd01.*"}, "lameduck", "foobar", "", "")   != bool 1)

Returns the correct values for hnd01:

{experiment="ndt.iupui",machine="mlab2.hnd02.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab3.hnd01.measurement-lab.org"} | 0
{experiment="ndt.iupui",machine="mlab2.hnd01.measurement-lab.org"} | 0
{experiment="ndt.iupui",machine="mlab3.hnd02.measurement-lab.org"} | 1
{experiment="ndt.iupui",machine="mlab1.hnd01.measurement-lab.org"} | 0
{experiment="ndt.iupui",machine="mlab1.hnd02.measurement-lab.org"} | 1

To avoid the idiosyncratic promql and the OR operation, we should construct a more explicit compound query using AND.

@stephen-soltesz: Do you have any initial ideas of how to construct the query not using the OR operator? I wasn't able to figure out how to do this at the time I was adding Prometheus support to mlab-ns. The basic issue I ran up against, and didn't figure out how to solve with AND, being that AND only returns timeseries from vector1 where the label set in vector1 and vector2 match and this isn't always the case, so the AND results excludes some timeseries that we need.

@nkinkade I believe it's possible to use a construct like ... AND ON(experiment, machine) ... -- this selects only the labels we care about matching on.

@stephen-soltesz: Maybe we can talk through this on a VC? In addition to the intersection issue, the AND operator only returns timeseries from vector1, so if vector1's value is 1, while, say, vector2 has a value of 0, the timeseries returned is 1, when we want it to be 0, since any 0 in any timeseries should yield a zero in the query.

During discussion we also discovered that the original query does not handle missing metrics correctly. That is, if node exporter is down such that the vdlimit metrics do not exist, then the original query defaults to the status of only the probe_success and script_success metrics. In cases where metrics are missing, the tool status should be zero.