ndelitski/rancher-alarms

Feature Request - Notification also from degraded to active

jayhding opened this issue · 6 comments

Right now notification will only be sent when service has become degraded for a while, but we would also like to receive notifications when it has recovered from degraded status. Then we can know what is the service's final status.

What do you think on how should we notify if a service periodically jumping from a degraded to an active state and vice versa? is it ok to receive too many notifications? current logic is when a service become degraded you will receive only one notification independent on next status changes, maybe we should have specific settings enabling this feature?

That's exactly what we often see that service is flapping between active and degraded, actually I did change the code to notify for both directions and we have used for some time.

It is true it will generate more emails and that's why we changed to notify as slack message.

But considering the convenient to access slack, we can easily know if a service is back to normal state without connecting to private network in after hours situation.

Definitely it is fine to control this feature by a specific flag.

We can see if @SydOps also share the same opinion as me.

I am not really care the recovered status. Agree with @ndelitski, no need too many notifications. On Rancher server + hosts, especially for enterprise, we will install thousands containers, if there are too many notifications, operators will ignore them directly.

Second, we don't use Rancher Alarms as main alarms system. We have others, such sensu, dynatrace, etc. These alarms system will report the application and service high level health, more than containers health. If one container is unhealthy, but HA/ELB or website works fine, we don't spend time on the problem immediately. Rancher-alarms for me is only for operators or developers who get quick notification for particular rancher container services. Only notify when it is needed.

The best is, within Slack, you can delete the messages, if the slack bot is smart enough, it should be fine to delete previous degraded message, if it thinks the broken container is back and active. But I don't know how difficult to write code as this way.

Recovery notification is good feature, if we can add the codes, but make sure we can have option to turn it on/off easily.

We all have different desires, use cases however the degraded->active to detect flapping has been very useful. Control by settings yes please.

Too many notifications to slack isn't really an issue particularly if you use a dedicated channel. Concept of DevOps/agile/CD here is to stop work and fix to keep the pipeline going. The spice must flow!

I doubt you can delete messages done by a webhook as its not a real bot/user, worth checking though. Deletion in my opinion however is changing history, where a potential log of that can help in doing a post-mortem of certain events.

What we have found is that if you get a rancher alarm, something is wrong so any operator really should look at it straight away especially if in production. Its much different to getting noise for something like host alerting on 'busy cpu' where its informative and can be ignored.

For the start if we implement an option like notifyWhenRecovered=true which is disabled by default, everybody ok with it? It will be configurable per target(email|slack...)

Absolutely. For first version that is great, using the same template I'd assume.