The devices managed by Apstra AOS generate large amounts of data over time. On its own,
this data is voluminous and unhelpful. Through Intent-Based Analytics (IBA), AOS
allows the operator to combine intent from the AOS graph database with current
and historic data from devices to reason about the network at-large.
For a detailed explaination of AOS IBA, please watch our recent webinar "Intent-Based Analytics: Prevent Network Outages and Gray Failures"!
Probes are the basic unit of abstraction in IBA. Operators can configure, create,
and delete probes. Generally, a given probe consumes some set of data from the
network, does various successive aggregations and calculations on it, and
optionally specifies some conditions of said aggregations and calculations on
which anomalies are raised.
Below is a collection of readily available probes organized by rough categories.
They also serve as examples to help build custom probes yourself.
Validation |
Description |
fabric_interface_flapping |
Detect interface that are flapping |
sfp |
Detect high and/or low warning thresholds in SFP RX Power, TX Power, Temperature, Voltage, or Current |
Validation |
Description |
mlag_domain_config_sanity_anomalies |
Detect MLAG (a.k.a MCLAG or VPC) domains with inconsistent configuration between member devices |
stp_state |
Detect STP blocked interfaces in all VLANs |
stp_state_change |
Detect any southbound interface STP state change seen in the last specified number of days |
mlag_domain_state_anomalies |
Detects MLAG domain state anomalies |
Validation |
Description |
border_leaf_default_gateway_anomalies |
Verify routing intent on border leafs by ensuring default gateway nexthop count is as expected in all VRFs |
non_border_leaf_default_gateway_anomalies |
Verify routing intent on non-border leafs by ensuring default gateway nexthop count is as expected in all VRFs |
border_device_default_gateway_anomalies |
Verify default routing intent on border devices (leaf or spine) ensuring the default gateway nexthop count is as expected for user specified VRFs |
Validation |
Description |
fabric_ecmp_imbalance |
Detect imbalance between interfaces within the Fabric |
ecmp_imbalance_external_interfaces |
Detect imbalance between interfaces exiting the Fabric |
mlag_imbalance |
Detect imbalance between member links of an aggregate MLAG link |
drain_node_traffic_anomaly |
Verify no application traffic on devices under maintenance |
counters_error_anomalies |
Detect Fabric interface showing alignment errors, FCS errors, runts, giants, or error packets |
pkt_discard_anomalies |
Detect Fabric interface having packet drops |
interface_queue_drops_anomalies |
Detect interfaces with overflowing buffers resulting in dropped packets |
monitor_packet_loss |
Detect high packet loss observed from pinging specified destination(s) |
server_rtt |
Detect high roundtrip time observed from pinging specified destination(s) |
fabric_bgp_anomalies |
Detects BGP anomalies in the fabric |
interface_status_anomalies |
Detects physical interface status |
leaf_bgp_vrf_anomalies |
Detects leaf vrf-aware BGP session anomalies |
Validation |
Description |
fabric_hotcold_ifcounter |
Detect hot and cold interfaces in the Fabric and flag systems with excessive cold or hot interfaces |
eastwest_traffic |
Show the distribution of north-south vs. east-west traffic in the Fabric |
bandwidth_utilization_history |
History of traffic patterns with varying degrees of aggregation |
Validation |
Description |
Headroom |
Calculate headroom between two servers along all available paths |
Validation |
Description |
virtual_infra_vlan_match |
Detect inconsistencies between physical underlay and virtual networking |
missing_vlan_vms |
Detect Virtual Machines that have connectivity issues due to configuration inconsistencies between physical underlay and virtual networking |
virtual_infra_lag_match |
Detect inconsistencies between physical underlay and hypervisor LAG configuration |
virtual_infra_missing_lldp_config |
Detect virtual infra hosts that are not configured for LLDP |
virtual_infra_hypervisor_redundancy_checks |
Detect single point of failures between hypervisor and physical underlay connectivity |
hypervisor_mtu_checks |
Detect hypervisor interfaces with MTU below threshold |
hypervisor_mtu_mismatch |
Detect MTU value deviation across hypervisor pnics |
Validation |
Description |
arp_usage_anomalies |
Detect devices with ARP table usage exceeding specified threshold |
unicast_route_usage_anomalies |
Detect devices with Unicast route table usage exceeding specified threshold |
multicast_route_usage_anomalies |
Detect devices with Multicast route table usage exceeding specified threshold |
interface_bandwidth_anomalies |
Detect interfaces with bandwidth exceeding specified threshold |
Validation |
Description |
anycast_rp_anomalies |
Verify rendezvous point intent by ensuring all systems designated as rendezvous points have the expected peers present |
anycast_rp_peer_count_anomalies |
Ensure high-availability of rendezvous points by ensuring all rendezvous points have min number of anycast peers configured |
mroute_count_anomalies |
Detect abnormal changes in number of multicast sources, groups or routes |
pim_neighbor_anomalies |
Verify Multicast intent by ensuring all SVI interfaces, leaf-spine and leaf-leaf links in every VRF have an expected PIM neighbor |
pim_rp_anomalies |
Verify every VRF in every switch in Fabric has expected rendezvous point IP configured |
vrfs_missing_rp |
Ensure any device acting as RP on any VRF is an RP for all other multicast enabled VRFs on that device |
Validation |
Description |
memory_usage_threshold_anomalies |
Detect memory leaks in specified process on all switches in the Fabric |
power_supply_anomalies |
Detect faults in power supply status, power supply fan status and power supply temperature status |
system_memory_usage_threshold_anomalies |
Detect switches having potential memory leaks in the Fabric |
Validation |
Description |
hostname_compliance |
Ensures the FQDN of any system matches the user specified regex |
os_version_anomalies |
Detect devices not running expected Operating System version |
Validation |
Description |
bum_to_total_traffic_anomalies |
Detect % of BUM traffic to overall traffic exceeding specified threshold |
bum_traffic_on_unlearnt_vteps_anomaly |
Detect when decap traffic is seen from an unlearnt remote VTEP |
hardware_vtep_counters_enabled |
Detect devices where hardware telemetry to capture VXLAN counters is not enabled |
vxlan_status |
Detect devices with VXLAN interface that is not up |
static_vxlan_vtep_anomalies |
Verify VXLAN intent by ensuring all L2 segments have expected VTEP flood list |
vxlan_address_table_anomalies |
Verify that the move count of any MAC address is not greater than 1 |
Validation |
Description |
acl_stat_anomalies |
Report on acl rule matches that exceed user defined thresholds |
Validation |
Description |
evpn (VTEP floodlist) |
Validate Type 3 routes for L2 VNIs |
evpn (VXLAN Routing) |
Validate VXLAN subnet (type 5) presence in BGP RIB |
evpn (Floodlist limit) |
Detect excessive per-VNI count of VTEPs in floodlist |
evpn (VRF limit) |
Detect excessive count of VRFs |
Validation |
Description |
lag_link_fault_tolerance |
Determines if a failure of one link in a server LAG can be tolerated based on the current traffic load |
spine_fault_tolerance |
Determines if the failure of a user specified number of spines can be tolerated based on the current traffic load across all spines |
All the probes listed above are available as part of AOS server predefined probe list or aos-cli predefined probe list. For the former, use AOS web UI to instantiate a predefined probe - you can find more details in the AOS documentation. For the latter, see the probe templates section below.
You can find working probes in the templates
subfolder.
The files in this folder are IBA probe json payloads that are represented as
JINJA templates. You need to use
aos-cli to load these probes onto the AOS server. The command to use in aos-cli is
probe create --blueprint <id> --file </path/to/template/file> [<additional_args>]
Authoring new probes requires familiarity with graph queries to select the elements of interest whose telemetry is ingested into the probes. You can find more details regarding graph queries in the AOS documentation.