threefoldfoundation/tft

Prometheus exporter metrics for our bridges

Opened this issue · 1 comments

Most of what we monitor happens at the point where a service is exposed. This works nicely in most cases (like the dashboard, activation service, playground, ..) but does not paint a complete picture of the state of a service.

A bridge has nothing exposed, so this does not fit in. There is also the issue that a service (like a bridge) can be running but is in a faulty state like not able to connect to a public chain node, a signer that does not respond, .. In that case, the monitoring would still say the service is up but has not clue of what state it is in.

To be able to tackle this we need 'state monitoring' so that we can determine if a service is running smoothly or has some kind of issue going on. Since it's unpractical to make a list of possible errors I was thinking it would be better to work with an 'error count' or 'warning count' or something similar.
On top of that it would also be good to get some insights into what the bridge is doing. Metrics like how many transfers did the bridge handle, what's the average bridge time, .. etc

So what would be expected of exposed metrics, watch the bridge logs for last 24 hours and:

  • increase/decrease an error count (if above 0, Ops can then trigger an alert chain)
  • increase/decrease a warning count
  • keep amount of transfers in last 24h
  • average bridge time in last 24h
  • is bridge in sync with chains
  • ..

What is most common is that these metrics are exposed on a specific port on the service. Our bridges run in Kubernetes, so we can easily expose that port for our monitoring to scrape.

These are just suggestions on what we would need to properly monitor these bridges. Any input is welcome to improve this or make the scope smaller if it would be a lot of work.

We would need these metrics for the 3 bridges we currently have running:

  • Stellar / TFchain
  • BSC / TFchain
  • ETH / TFchain

Add Prometheus metrics so operations can finish the ops issue related to this https://github.com/threefoldtech/tf_operations/issues/1920