hashicorp/consul

Agent telemetry filter_default configuration option not respected by prometheus interface

wolfmd opened this issue · 2 comments

Overview of the Issue

When calling the agent telemetry endpoint /v1/agent/metrics, metrics can be viewed by default in a json format or using the parameter format=prometheus to receive the metrics in prometheus format. By default, all metrics described in the documentation are available in both json and prometheus format.

However, if the prefix_filter option is set, the configuration seems to only apply to the non-Prometheus view of the metrics. Similarly, filter_default does not have any effect on the prometheus view of metrics.

Reproduction Steps

  1. Start an agent with no prefix_filter parameter
consul agent -dev -node localhost -client 127.0.0.1 -hcl 'telemetry { prometheus_retention_time = "10m" }'
  1. Check for a metric such as consul.serf in both outputs
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics | head
{
    "Timestamp": "2024-10-15 23:02:40 +0000 UTC",
    "Gauges": [
        {
            "Name": "consul.302com1.autopilot.failure_tolerance",
            "Value": 0,
            "Labels": {}
        },
        {
            "Name": "consul.302com1.autopilot.healthy",
root@mynode:/state/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | head
# HELP consul_302com1_autopilot_failure_tolerance consul_302com1_autopilot_failure_tolerance
# TYPE consul_302com1_autopilot_failure_tolerance gauge
consul_302com1_autopilot_failure_tolerance 0
# HELP consul_302com1_autopilot_healthy consul_302com1_autopilot_healthy
# TYPE consul_302com1_autopilot_healthy gauge
consul_302com1_autopilot_healthy 1
# HELP consul_302com1_cache_entries_count consul_302com1_cache_entries_count
# TYPE consul_302com1_cache_entries_count gauge
consul_302com1_cache_entries_count 1
  1. Start an agent with a prefix_filter parameter such as removing consul.serf metrics
    consul agent -dev -node localhost -client 127.0.0.1 -hcl 'telemetry { prometheus_retention_time = "10m", filter_default = false, prefix_filter = ["+consul.serf"] }'

  2. Confirm the configuration is in place on the agent

root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/self | jq -r '.DebugConfig.Telemetry'
{
  "AllowedPrefixes": [],
  "BlockedPrefixes": [
    "consul.serf",
    "consul.rpc.server.call"
  ],
...
  "EnableHostMetrics": false,
  "FilterDefault": false,
  "MetricsPrefix": "consul",
  1. Check metrics on both the json and prometheus metrics interface to see that serf metrics are the only ones remaining on the non-prometheus result but prometheus still contains other metrics
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics | head
{
    "Timestamp": "2024-10-15 23:19:00 +0000 UTC",
    "Gauges": [],
    "Points": [],
    "Counters": [],
    "Samples": [
        {
            "Name": "consul.serf.queue.Event",
            "Count": 1,
            "Rate": 0.1,
root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | head
# HELP consul_acl_ResolveToken This measures the time it takes to resolve an ACL token.
# TYPE consul_acl_ResolveToken summary
consul_acl_ResolveToken{quantile="0.5"} NaN
consul_acl_ResolveToken{quantile="0.9"} NaN
consul_acl_ResolveToken{quantile="0.99"} NaN
consul_acl_ResolveToken_sum 0
consul_acl_ResolveToken_count 0
# HELP consul_acl_authmethod_delete
# TYPE consul_acl_authmethod_delete summary
consul_acl_authmethod_delete{quantile="0.5"} NaN

Consul info for both Client and Server

Agent is running consul 1.17.4. This can be reproduced in agent dev mode

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = dev
	revision = 3e2302b+
	version = 1.17.4
	version_metadata =
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 127.0.0.1:8300
	server = true
raft:
	applied_index = 64
	commit_index = 64
	fsm_pending = 0
	last_contact = 0
	last_log_index = 64
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:27ef875d-74af-30ff-1c7e-0ed5b987609b Address:127.0.0.1:8300}]
	latest_configuration_index = 0
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 96
	goroutines = 186
	max_procs = 96
	os = linux
	version = go1.22.5 X:boringcrypto
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 1
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

Running on bare metal Debian

Also noting that similar behavior is seen when setting prefix_filter = ["-consul"]

I'm not sure if I should note this here or in a new issue, but setting the metrics_prefix to anything cuts the number of metrics exported in prometheus format down dramatically

root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | grep -v '#' | grep consul | wc -l
599

vs

root@mynode:/home/wolfmd# curl -sS 127.0.0.1:8500/v1/agent/metrics?format=prometheus | grep -v '#' | grep consul | wc -l
63

when metrics_prefix = "" is set