spike: Define critical alerts for archway nodes
Closed this issue · 1 comments
sync-by-unito commented
sync-by-unito commented
➤ Joonas Lehtimäki commented:
shahbazn We need to figure out how/what to:
- Group. I think we should group the logs by network, i.e. constantine-1, titus etc..
- What strings to watch from the logs? ERR is not good enough because there are many of those which are not actual errors that require our attention, i.e. voting failed
- How many errors / minute?
- Do we want to watch processes? Through systemd logs or?
- Services that we might wanna monitor and what do we want from there? Like e.g. Caddy server, do we wanna count 500/400 codes etc?