Improve service monitoring and alerting

Question

Improve service monitoring and alerting

Closed this issue 9 months ago · 2 comments

Issue summary
The service monitoring and alerting could be better at the moment. There are several problems:

Monitoring for services is not standardised. The Forest node exposes some Prometheus and host metrics, whereas the sync check has no monitoring except for Slack messages.
Alerting is not ideal. For example, when the daily snapshot service fails to deploy (before the NR is installed), there are no alerts about this condition.

Investigate potential improvements. Those may include:

Having some standardised API for all services. For example, all services should expose a health check endpoint.
Should all services be containerized and include the NR agent independently without installing it on the host?

Look into the NR docs to see what's feasible. Remember that we may rewrite every service in Rust, so relying on NR Ruby gems is a no-go if there's no alternative in Rust (the C SDK was recently archived, but there's something with OpenTelemetry to be investigated).

This issue may spawn additional ones if it proves to be too big.

Other information and links

Answer 1 · 2023-10-17T10:05:38.000Z

@lemmih Feel free to add any additional requirements or things to remember. I'll come up with a solution after some investigation into current NR capabilities.

Answer 2 · 2023-10-17T10:24:48.000Z

Requirements:

Support for both prod and dev deployments. Dev deployments should be as close to prod as possible (dev deployments should log to Slack, have alerts, etc).
Monitoring must be entirely external from anything we deploy. In other words, a failed installation of NR should not inhibit monitoring.
Custom alerts should be standardized and easy to write. This could take the form of a health check endpoint.
Re-deploying a service should not trigger alarms.