netdata/helmchart

Clarification of how to override alarm rules

Closed this issue · 4 comments

Dear Community,

I have been struggling with how to override the alarm rules on our Netdata deployment running on 3 of our Kubernetes clusters. We are using the Cloud as well to integrate all dashboards into one. Pretty neat.

However, could not find a clear description of how to override the rules that drive alarms. Here is an example:

image

Story: On our clusters, we host all kinds of workloads, including data processing (Spark), web services, Kubeless functions, load balancers, a bunch of operators, Seldon models, and the list can continue. We integrated alarms with Discord - but getting a lot of false positives, thus, currently, Netdata is really not much use to us. On a single-node Dockerized environment we learned to update alarm rules as described in the documentation, by updating rules under health.d directory.

Problem: health.d is empty in the case of both child and parent Pod containers. I checked around the Pod, including the volumes mounted, and found no records of where these rules are.

Setup: this is how we deploy Netdata:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: netdata
  namespace: kube-system
  annotations:
    fluxcd.io/automated: "false"
spec:
  chart:
    repository: https://netdata.github.io/helmchart/
    name: netdata
    version: 3.6.3
  releaseName: netdata
  values:
    parent:
      database:
        storageclass: local
      alarms:
        storageclass: local
      configs:
        stream:
          data: |
            [11111111-2222-3333-4444-555555555555]
              enabled = yes
              history = 36000
              default memory mode = save
              health enabled by default = auto
              allow from = *
        health:
          data: |
            # Configuration for alarm notifications
            #
            # This configuration is used by: alarm-notify.sh
            # changes take effect immediately (the next alarm will use them).
            #
            ...
            ...

I was looking around in the chart documentation to figure out where to put the alarm rules, but could not find any place.

I'd be happy to be pointed to resources and told what to do. :-)

Thanks,

Hi, @zzvara 👋 Sorry for the late reply.

but getting a lot of false positives

Could you provide some examples? We can tune our default alarms if there are any problems.

On a single-node Dockerized environment we learned to update alarm rules as described in the documentation, by updating rules under health.d directory.

Same works for k8s. For instance, if I want to overwrite health.d/cgroups.conf on the parent node:

  • create overwrite.yaml with my cgroups.conf
overwrite.yaml [Click]
parent:
  configs:
    health_cgroups:
      enabled: true
      path: /etc/netdata/health.d/cgroups.conf
      data: |
        template: cgroup_10min_cpu_usage
               on: cgroup.cpu_limit
            class: Cgroups
        component: CPU
             type: Utilization
               os: linux
            hosts: *
           lookup: average -10m unaligned
            units: %
            every: 1m
             warn: $this > (($status >= $WARNING)  ? (75) : (85))
             crit: $this > (($status == $CRITICAL) ? (85) : (95))
            delay: down 15m multiplier 1.5 max 1h
             info: average cgroup CPU utilization over the last 10 minutes
               to: sysadmin

         template: cgroup_ram_in_use
               on: cgroup.mem_usage
            class: Cgroups
        component: Memory
             type: Utilization
               os: linux
            hosts: *
             calc: ($ram) * 100 / $memory_limit
            units: %
            every: 10s
             warn: $this > (($status >= $WARNING)  ? (80) : (90))
             crit: $this > (($status == $CRITICAL) ? (90) : (98))
            delay: down 15m multiplier 1.5 max 1h
             info: cgroup memory utilization
               to: sysadmin
  • apply it when installing/updating netdata helmchart

0 netdata (master %)$ kubectl exec netdata-parent-6879f8474b-xmmz8 -- cat /etc/netdata/health.d/cgroups.conf
template: cgroup_10min_cpu_usage
       on: cgroup.cpu_limit
    class: Cgroups
component: CPU
     type: Utilization
       os: linux
...

Well, we have problems with runaway AI models that eat up a lot of memory:

  • sigma04 is critical, cgroup_k8s_cntr_development_fnp-backend-inference-fnp-backend-inference-0-classifier-5wmq4h_classifier.mem_usage (mem), cgroup ram in use = 98.5%

Also frequent packet drops:

  • sigma03 needs attention, net_drops.br0 (br0), inbound packets dropped = 30 packets

Frequent disk utilization due to peaks:

  • sigma01 needs attention, disk_util.sda (sda), 10min disk utilization = 90.8%

Disk utilization is kicked off when we build JARs using Bamboo Agent on any one of our machines.

In general, there are more utilization fluctuations and these get reported by Netdata, but in general, these may be normal operations.

Thanks for the description on how to update the configuration, it was super useful!

We adjusted a lot of alarms in the latest stable

inbound packets dropped = 30 packets

We removed warn/crit triggers from inbound_packets_dropped alarm, so no notifications anymore. It is used in inbound_packets_dropped_ratio.

10min disk utilization

We changed 10min_disk_utilization alarm to to silent due to a lot of false positives - no notification anymore too.


I suggest you update to the latest version 😄 And this feedback is very useful, we want to improve our stock alarms.

Thanks for the description on how to update the configuration, it was super useful!

Perfect. If you have any more questions do not hesitate to ask!