czerwonk/bird_exporter

state label in bird_protocol_up leads to unique time serie for each state

Opened this issue · 1 comments

Overview

In Prometheus, every time series is UNIQuely identified by its metric name and set of LABELS (source). So, when a state (label) changes in bird_protocol_up metric, new time series is created in addition to the one with previous state. This ruins the metric: instead of one bird_protocol_up time series per BIRD protocol we see several in parallel. And when the state changes regularly (e.g. flap), we have gaps in the series.

How to replicate

If a BGP peer on other side becomes unavailable, BIRD tries to reconnect (goes through different states). In the example below, in Prometheus we see three different bird_protocol_up time series for one peer. They correspond to the BGP states (state labels):

  • "Idle Socket: No route to host"
  • "Connect Socket: No route to host"
  • "Active Socket: No route to host"

All three exist in the TSDB in parallel.

Problems this approach creates

  • When the protocol state changes (e.g. flaps), bird_exporter reports only current state. So, at the moment of scraping it can be one state. A second after that the state is different, but we don't see it in Prometheus. Different combinations of the scraping intervals and protocol timers create different (weird) results in monitoring.
  • In a complex environment with thousands of peers (thus many labels per peer) an unstable (unpredictable) number of metrics per protocol is difficult to manage. Idempotence is difficult to achieve. Automation breaks.
  • It is difficult to understand, which BGP state is current. Prometheus returns all time series (in example above three time series) for the single bird protocol. It is the same with instant queries as the series with different states are considered unique
  • It is difficult to count peers, for which bird_protocol_up == 0. Instead of actual number of down peers count shows number of unique time series, which is not what we want to see. I still managed to do it using count(group by (state) {}), but IMHO this is more a workaround than a proper solution

Suggestion

  • Remove state label from bird_protocol_up metric
  • Return to a separate bgp_state metric as in #46, but make it optional (activated with a flag at startup). There are people who need it, so they will have it. Others, who do believe that Prometheus is only for numerical metrics, won't have it.

Who will do it

I can implement it myself if the agreement is made.

I am faced same problem. Variable state label in bird_protocol_up metric doesn't look right.