sapcc/helm-charts

[prometheus-vmware-rules] Create an alert for all HBA paths are down

Closed this issue · 0 comments

The critical symptom/condition in VMware Aria Operations occurs as "All HBA paths are down" on the host system --> triggered when one or more host adapters report that all paths are down and may impact the storage connectivity ==> the alert clears when at least one path is active

In the vCenter events the alert is structured as follows: Lost connectivity to storage device <...>, Path <...> is down

Use and leverage the AlertCollector from the vrops-exporter --> query the resource kind at the inventory's internal API --> to generate and build the Prometheus metric

  • Install/upgrade to the latest version of the vRealize Operations Management Pack for Storage Devices on all vROps instances
  • Define an VMware alert based or out of the scraped metric
  • Fetch out the following attributes/field values:
    • VC with region and AZ
    • Cluster name
    • Affected DS('s)
    • FQDN of event
    • Host/node name
  • Test the alerting in the Concourse CI pipeline
    • Roll out/deploy changes to the region(s)
  • Simulate an APD alert by disabling the network cards: --> (iSCSI software adapter/initiator, physical NIC 1 and 4) in the UCS cluster manager
  • Test and validate