archway-network/networks

spike: Define critical alerts for archway nodes

Closed this issue · 1 comments

┆Issue is synchronized with this Jira Task by Unito

➤ Joonas Lehtimäki commented:

shahbazn We need to figure out how/what to:

  • Group. I think we should group the logs by network, i.e. constantine-1, titus etc..
  • What strings to watch from the logs? ERR is not good enough because there are many of those which are not actual errors that require our attention, i.e. voting failed
  • How many errors / minute?
  • Do we want to watch processes? Through systemd logs or?
  • Services that we might wanna monitor and what do we want from there? Like e.g. Caddy server, do we wanna count 500/400 codes etc?