blue-yonder/azure-cost-mon

Constant costs might decrease due to numerical instabilities

StephanErb opened this issue · 6 comments

Exported data contains long floating point numbers such as 0.0000032120027740023963. Due to the nature of the floating points numbers, aggregating those can lead to unstable results as the addition is non-commutative. This is problematic as it prevents us from defining meaningful aggregation rules over exported cost data.

Please see prometheus/prometheus#2951 for details.

We don't need full precision here, so we should round the results to 2 or 3 digits before emitting.

Adopted workaround:

# FIXME: round is a workaround for https://github.com/blue-yonder/azure-cost-mon/issues/12
job:azure_costs_eur:sum =
    round(sum(azure_costs_eur))

The problem is even worse. Due to these instabilities it could happen that counters got smaller values when it should be constant. This results in a counter reset and totally invalidates the results of the increase or rate functions within Prometheus.

Released v0.4.1 to fix the issue.

Going to integers reduced the probability of the problem to occur, but we might still see flaps and counter resets as a result. Essentially, these are items that are constant in reality (resources that have been decommissioned) but the resulting value flaps around an integer border.

We are using the following aggregation rules as a workaround right now:

#
# Azure
#
# We ignore the Prometheus rule format here by not specifying the lables in the series before
# the first column. We simply don't really know which ones are there. In any case, we still need
# the recording rule as increase/changes over 2 days are costly to compute for plots.
#

# Cost increase over the last 2 days
# We cannot use the normal increase function here as the Azure API is providing slighly
# fluctuating costs. Those would be interpreted as counter resets, leading to wrong results.
azure_costs_eur:increase2d =
  (azure_costs_eur - azure_costs_eur offset 2d)

# Number of updates from the Azure API over the last 2 days. The Azure API is providing changes once
# a day but not at the same time. So we expect this value to be either 1 or 2.
azure_costs_eur:changes2d =
  changes(azure_costs_eur[2d])

# This metric shows our total daily costs. Due to the slow moving counters provided by the Azure API,
# the value is computed as the average over the last 2 days. In Prometheus speak, we emit
# the average observation size over a 2 day time period. As we only have ~1 change per day
# this is our daily costs.
job:azure_costs_eur:mean2d =
  sum(
      (azure_costs_eur:increase2d > 0)
    /
      (azure_costs_eur:changes2d  > 0) # We need the > 0 filter to prevent the propagation of NaN.
  )

I think i ruled out that the non-commutativity of floats is the problem. Could track down the flaps to occur only during updates of the API. Also, I wasn't able to reproduce the flapping with python floats. So I suspect the server-side has the issue and not the code that aggregates. However rounding or truncating in some way might fix it.