moira-alert/moira

Warn user that TTL is too small

beevee opened this issue · 1 comments

FEATURE

Summary

Warn user that low TTL can lead to false positives in trigger creation/editing form. Low TTL threshold can be configurable, but 300 seems to be a reasonable default.

Details

Moira can generate false NODATA events in case there is a temporary failure in incoming metric delivery (e.g. some of carbon-relay replicas failed). We have some protection measures (e.g. heartbits for complete loss of incoming data), but partial failures continue to be a problem.

Detecting partial failures is possible: one can automatically turn off notifications if incoming metric flow falls more than N% in volume. However, it is always a compromise: there are natural fluctuations in metric volume that can falsely trigger this protection and therefore increase operational load for on-duty engineers.

Our production experience shows that turning off notifications is adequate if the incoming metric flow falls more than 10% in volume compared to the last hour's average, and this reduced flow doesn't recover for at least 4 minutes. Obviously, this approach does nothing to protect users that set TTLs of less than 5 minutes in their triggers. We fail to warn these users that they will experience more false positives if they configure their triggers with low TTLs.

We should also forbid 0 or any negative value.