[metrics]: Not able to go into a firing state when VPN tunnel is down for VPN-tunnel and ALB, CLB, & NLB
rajualap opened this issue · 1 comments
Hi ,
I have created a 'cloudwatch-exporter.yml' file to fetch metrics from CloudWatch for RDS, Lambda, VPN-tunnel, ALB, CLB, and NLB. We are successfully obtaining metrics for RDS and Lambda, and on Prometheus, we can see RDS and Lambda metrics. However, when there is an issue with RDS and Lambda, alert rules go into a firing state and generate alerts. Unfortunately, we are not receiving alerts for VPN-tunnel and ALB, CLB, & NLB. Can you please help with identifying the reason? Below, you'll find the 'cloudwatch-exporter.yml' file and alert rules.
Please assist in resolving this issue
cloudwatch-exporter.yml file here :-
region: ap-south-1
metrics:
- aws_namespace: AWS/RDS
aws_metric_name: BurstBalance
aws_dimensions: [DBInstanceIdentifier]
aws_statistics: [Average]
- aws_namespace: AWS/RDS
aws_metric_name: FreeableMemory
aws_dimensions: [DBInstanceIdentifier]
aws_statistics: [Average]
- aws_namespace: AWS/RDS
aws_metric_name: CPUUtilization
aws_dimensions: [DBInstanceIdentifier]
aws_statistics: [Average]
- aws_namespace: AWS/RDS
aws_metric_name: DatabaseConnections
aws_dimensions: [DBInstanceIdentifier]
aws_statistics: [Average]
- aws_namespace: AWS/Lambda
aws_metric_name: Duration
aws_dimensions: [FunctionName]
aws_statistics: [Average]
- aws_namespace: AWS/Lambda
aws_metric_name: Errors
aws_dimensions: [FunctionName]
aws_statistics: [Sum]
- aws_namespace: AWS/Lambda
aws_metric_name: Invocations
aws_dimensions: [FunctionName]
aws_statistics: [Sum]
- aws_namespace: AWS/ElasticLoadBalancing
aws_metric_name: UnHealthyHostCount
aws_dimensions: [LoadBalancerName]
aws_statistics: [Average]
- aws_namespace: AWS/ElasticLoadBalancing
aws_metric_name: RequestCount
aws_dimensions: [LoadBalancerName]
aws_statistics: [Sum]
- aws_namespace: AWS/VPN
aws_metric_name: TunnelState
aws_dimensions: [VpnId]
aws_statistics: [Average]
####################################
Prometheus VPNtunnel alerts file here 👎
groups:
- name: VPNAlerts
rules:
# Alert if the average VPN tunnel state is less than 1 (indicating down) for 5 minutes
- alert: VPNDownCritical
expr: aws_vpn_tunnel_state_average < 1
for: 5m
labels:
severity: critical
annotations:
LABELS: '{{ $labels }}'
VALUE: '{{ $value }}'
summary: 'VPN Tunnel Down Critical'
description: 'At least one VPN tunnel is down.'
# Alert if the average VPN tunnel state is less than 1 for 1 minute
- alert: VPNDownWarning
expr: aws_vpn_tunnel_state_average < 1
for: 1m
labels:
severity: warning
annotations:
LABELS: '{{ $labels }}'
VALUE: '{{ $value }}'
summary: 'VPN Tunnel Down Warning'
description: 'At least one VPN tunnel is down.'
# Alert if there are changes in VPN tunnel state indicating flapping for 5 minutes
- alert: VPNFlapping
expr: changes(aws_vpn_tunnel_state_average[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
LABELS: '{{ $labels }}'
VALUE: '{{ $value }}'
summary: 'VPN Tunnel Flapping'
description: 'At least one VPN tunnel is experiencing flapping.'
Cloudwatch Metrics here
What does aws_vpn_tunnel_state_average
look like in the /metrics
endpoint? What does it look like in the Prometheus graph and table views?
It seems that you are using the default delay_seconds
and set_timestamp
. This means the metrics are not visible to an instant query in Prometheus "now", as your rules are using – see the documentation for details.
Try min_over_time(aws_vpn_tunnel_state_average[15m]) < 1
and changes(aws_vpn_tunnel_state_average[30m]) > 1
to look back further.