omgnetwork/elixir-omg

Maintain datadog monitors via code

boolafish opened this issue · 6 comments

Why

We are having more and more alarms/metric/monitors in data dog now. Currently it is a messy manual process. Also, there are 2 possible ways to trigger an alarm. Could be via application code directly or via data dog monitor.

We have some alarm that is directly triggered in application code. However, it would be great if all alarm can be just in data dog and the application only need to emit metric. Currently for most cases we are emitting both metric and alarm via application code.

But even not doing the simplification, using some code as infra to maintain our current messy monitors might be already beneficial enough.

Note

One potential tool is terraform: https://www.terraform.io/docs/providers/datadog/r/monitor.html
Seems like supporting datadog too.

Just a note: you're using the word alarm, which has two definitions:

Application alarm that allows the application to be reactive.

Metric alarm.

My intention is [alarm_word_placeholder] that would notify engineer to react on. I think datadog use the word monitor. Just I am used to use the word alarm previously 😅

Just to expand Ino's comment a bit. The in-app alarms right now are mostly the ones that signals that the service is not in good shape to accept new transactions, e.g. :boot_in_progress, :ethereum_connection_error, :ethereum_stalled_sync, :invalid_fee_source, :main_supervisor_halted. And so when these ones are raised, new transactions are rejected immediately. So I think they'll have to remain in the application? And the rest of them, that needs human's help are already in datadog.

The only one that seems out of place is :system_memory_too_high which is being removed in #1678 (comment)

If we want to simplify a bit and standardize everything, we can make those :boot_in_progress metric to DG as well. In app they still handles the original logic to react to those but instead of sending alarm directly, it sends metric instead. So we have maintained list of monitors/alarms in a single place to see what we have.

But....might not bring much benefit as we already have those code in place and is working fine. Just some burden to be able to see what are all the monitors we have.

Ah I see. I think we can send an increment metric for each of the event above, then we can start to see their frequency patterns. The alarm would also be treated more properly by setting alarm threshold to > 0 for each metric. Right now they're sort of reported as discontinuous events.

But I concur that it's not high priority given other things we have in hand at the moment.

lol my personally feeling is nobody would take this up at all hahaha unless gold team board become cleaner 😅

But can we pick this up in new services like chch v2 to be a standard approach for new service ?! @InoMurko @achiurizo