Agent telemetry filter_default configuration option not respected by prometheus interface
wolfmd opened this issue · 2 comments
Overview of the Issue
When calling the agent telemetry endpoint /v1/agent/metrics, metrics can be viewed by default in a json format or using the parameter format=prometheus to receive the metrics in prometheus format. By default, all metrics described in the documentation are available in both json and prometheus format.
However, if the prefix_filter option is set, the configuration seems to only apply to the non-Prometheus view of the metrics. Similarly, filter_default does not have any effect on the prometheus view of metrics.
Reproduction Steps
- Start an agent with no prefix_filter parameter
consul agent -dev -node localhost -client 127.0.0.1 -hcl 'telemetry { prometheus_retention_time = "10m" }'
- Check for a metric such as consul.serf in both outputs
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics | head
{
"Timestamp": "2024-10-15 23:02:40 +0000 UTC",
"Gauges": [
{
"Name": "consul.302com1.autopilot.failure_tolerance",
"Value": 0,
"Labels": {}
},
{
"Name": "consul.302com1.autopilot.healthy",
root@mynode:/state/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | head
# HELP consul_302com1_autopilot_failure_tolerance consul_302com1_autopilot_failure_tolerance
# TYPE consul_302com1_autopilot_failure_tolerance gauge
consul_302com1_autopilot_failure_tolerance 0
# HELP consul_302com1_autopilot_healthy consul_302com1_autopilot_healthy
# TYPE consul_302com1_autopilot_healthy gauge
consul_302com1_autopilot_healthy 1
# HELP consul_302com1_cache_entries_count consul_302com1_cache_entries_count
# TYPE consul_302com1_cache_entries_count gauge
consul_302com1_cache_entries_count 1
-
Start an agent with a prefix_filter parameter such as removing
consul.serf
metrics
consul agent -dev -node localhost -client 127.0.0.1 -hcl 'telemetry { prometheus_retention_time = "10m", filter_default = false, prefix_filter = ["+consul.serf"] }'
-
Confirm the configuration is in place on the agent
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/self | jq -r '.DebugConfig.Telemetry'
{
"AllowedPrefixes": [],
"BlockedPrefixes": [
"consul.serf",
"consul.rpc.server.call"
],
...
"EnableHostMetrics": false,
"FilterDefault": false,
"MetricsPrefix": "consul",
- Check metrics on both the json and prometheus metrics interface to see that serf metrics are the only ones remaining on the non-prometheus result but prometheus still contains other metrics
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics | head
{
"Timestamp": "2024-10-15 23:19:00 +0000 UTC",
"Gauges": [],
"Points": [],
"Counters": [],
"Samples": [
{
"Name": "consul.serf.queue.Event",
"Count": 1,
"Rate": 0.1,
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | head
# HELP consul_acl_ResolveToken This measures the time it takes to resolve an ACL token.
# TYPE consul_acl_ResolveToken summary
consul_acl_ResolveToken{quantile="0.5"} NaN
consul_acl_ResolveToken{quantile="0.9"} NaN
consul_acl_ResolveToken{quantile="0.99"} NaN
consul_acl_ResolveToken_sum 0
consul_acl_ResolveToken_count 0
# HELP consul_acl_authmethod_delete
# TYPE consul_acl_authmethod_delete summary
consul_acl_authmethod_delete{quantile="0.5"} NaN
Consul info for both Client and Server
Agent is running consul 1.17.4. This can be reproduced in agent dev mode
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease = dev
revision = 3e2302b+
version = 1.17.4
version_metadata =
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = 127.0.0.1:8300
server = true
raft:
applied_index = 64
commit_index = 64
fsm_pending = 0
last_contact = 0
last_log_index = 64
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:27ef875d-74af-30ff-1c7e-0ed5b987609b Address:127.0.0.1:8300}]
latest_configuration_index = 0
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 2
runtime:
arch = amd64
cpu_count = 96
goroutines = 186
max_procs = 96
os = linux
version = go1.22.5 X:boringcrypto
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 1
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
Operating system and Environment details
Running on bare metal Debian
Also noting that similar behavior is seen when setting prefix_filter = ["-consul"]
I'm not sure if I should note this here or in a new issue, but setting the metrics_prefix to anything cuts the number of metrics exported in prometheus format down dramatically
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | grep -v '#' | grep consul | wc -l
599
vs
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | grep -v '#' | grep consul | wc -l
63
when metrics_prefix = ""
is set