netdata/netdata-cloud

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name)

Closed this issue · 7 comments

Problem

Using a cause of disk.space alert on multiple Mount Points, if I want to silencing the alert for a specific Mount Point I'm not able to do it with the existing available attributes:

  • Alert name
  • Alert context (chart context)

image

Description

To be able to have a finer-grain control on silencing some specific alert instances, be it Mount Points, Network Devices, or even Database Instances, Netdata should provide that level of flexibility

Importance

really want

Value proposition

  1. Provide more flexibility on the Alert Notification Silencing Rules

Proposed implementation

There are two considered options:

  1. Use the current available attributes on an alert chart name (display) / chart id (store in DB)

    • when user goes from an Active Alert this should be immediately pre-filled
    • when user start from a blank rule, the new attribute should only be available once the user fills in either "alert name" or "alert context", so we can provide a pre-filtered list of "alert instances" (name TBC)
  2. Rely on chart labels, like the current alert definitions do (check learn here) to allow the user to specify how he ensure some specific alerts over given chart(s) are silenced

@car12o made two proposals of solution based on what we discussed on the daily, I know for 2. you had to check something before we know it is a way forward
do you think you could update this ticket with your finding when you are able to do it?

I can confirm on alert transition we don't have chart labels, but we have it on the alert config, although I don't know if it's what you expect.
here's some alert config examples:

template                           |chart                                                         |component    |units                |info                                                                                                                          |summary                                        |host_labels      |chart_labels                                           |
-----------------------------------+--------------------------------------------------------------+-------------+---------------------+------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+-----------------+-------------------------------------------------------+
mdstat_mismatch_cnt                |md.mismatch_cnt                                               |RAID         |unsynchronized blocks|number of unsynchronized blocks for the ${label:device} ${label:raid_level} array                                             |                                               |                 |raid_level=!raid1 !raid10 *                            |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
postgresql_pg_wall_disk_space_usage|disk.space                                                    |PostgreSQL   |%                    |The percentage of Disk Space being used by the pg_wall.                                                                       |Disk ${label:mount_point} (pg_wall) space usage|_os=linux freebsd|mount_point=/media/pgdata_adto                         |
DLE_CAS_sync_instance_lag          |DLE_CAS.sync_instance_lag                                     |sync_instance|seconds              |DLE_CAS Sync instance - high lag of WAL replay. Time stamp of last transaction replayed during recovery exceeds the threshold.|                                               |                 |_collect_module=DLE_CAS _collect_plugin=charts.d.plugin|
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |disk ${label:mount_point} inode utilization                                                                                   |                                               |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
rds_freeable_memory_alert          |prometheus.cloudwatch_exporter.aws_rds_freeable_memory_average|             |MB                   |AWS RDS instance freeable memory                                                                                              |                                               |                 |dbinstance_identifier= !koi-nonprod-infra-mysql *      |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |Total inode utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} inode usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |

bear in mind that we do have some configs without any chart labels.

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

bear in mind that we do have some configs without any chart labels.

I think this is probably because of older version agents where we weren't using labels

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

I agree it is easier for now, the discussion was about if it would make sense to go towards a more ideal solution relying on labels since it is also how we are setting this on alert definitions.
I'm ok to progress with the alert instance and we can revisit this later

@kapantzak from your side all good?

I think this is probably because of older version agents where we weren't using labels

I don't think that's the case, as I sort by created timestamp and I still got some configs with empty chart labels.

@hugovalente-pm using alert instance seems easier and more straight forward to me too.

However I'm not sure if I have this information at that point. I see that I get contexts, names and roles from this endpoint: api/v2/spaces/{spaceID}/alarms/metas, but how do I get the instance?
@car12o

@kapantzak here's how to get the data, let me know if something is not clear

Get instances from alert name or context

POST /api/v2/spaces/{spaceID}/rooms/{roomID}/alerts
body:

{
  "options": ["instances"],
  "scope": {
    "nodes": ["{nodeID}"], // if you want to filter by node
    "contexts": ["{context}"] // if want to get instances by context (ex. disk.space)
  },
  "selectors": {
    "alert": ["{alert_name}"] // if want to get instances by alert name (ex. disk_space_usage)
  }
}

all these parameters are optional but as we discuss, to filter out all possible instances, we should always either specify contexts or alert.

what identifies an alert instance is the chart, the response looks like this

{
  "api": 2,
  "nodes": [
    // ...
  ],
  "alert_instances": [
    {
      "ni": 2,
      "ati": null,
      "sum": "Disk / space usage",
      "info": "Total space utilization of disk /",
      "nm": "disk_space_usage",
      "ch": "disk_space._", // chart_id - this should be the field used when posting a rule
      "ch_n": "disk_space._", // chart_name - this should be the field used to display on the UI (friendly name)
      "ctx": "disk.space",
      "st": "CLEAR",
      "v": 0,
      "t": 0,
      "tr_i": "b047bf45-f831-49c8-b8de-6d76f5712858",
      "tr_v": 9.579043377550992,
      "tr_t": 1710857916,
      "units": "%",
      "cfg": "13038942-685d-4c69-9431-5d8877db1f80",
      "src": "line=10,file=/usr/lib/netdata/conf.d/health.d/disks.conf",
      "exec": "/usr/libexec/netdata/plugins.d/alarm-notify.sh",
      "tp": "System",
      "cl": "Utilization",
      "cm": "Disk",
      "to": "sysadmin",
      "slc": {
        "state": "NONE"
      }
    }
  ]
}

this is released