grafana/k8s-monitoring-helm

Correct/best way to drop metrics?

tcborg opened this issue · 2 comments

Hello,

I am trying to drop metrics, but I am unsure if I am hitting some ceil or limitation here. After a few metrics, the drop configuration stops working. For instance, in the snippet below, all metrics fail to drop after go_gc_duration_seconds_count|grafana_kubernetes_monitoring_build_info.

Also, is there a way to drop all metrics except? Because I only need a few.

metrics:
  extraMetricRelabelingRules: |-
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "kube_namespace_status_phase|container_cpu_cfs_periods_total|agent_build_info|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_gpu_allocation"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "container_memory_cache|container_memory_rss|container_memory_swap|container_memory_working_set_bytes|container_network_receive_bytes_total"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "deployment_match_labels|go_gc_duration_seconds|go_goroutines|go_memstats_alloc_bytes|go_memstats_alloc_bytes_total|go_gc_duration_seconds_sum|go_info|go_memstats_buck_hash_sys_bytes"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "go_memstats_frees_total|go_memstats_gc_cpu_fraction|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_objects|go_memstats_heap_released_bytes"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "go_memstats_heap_sys_bytes|go_memstats_last_gc_time_seconds|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "go_memstats_next_gc_bytes|go_memstats_other_sys_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "kube_daemonset_created|kube_daemonset_labels|kube_daemonset_metadata_generation|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_status_number_ready|kube_daemonset_status_number_unavailable|kube_daemonset_status_observed_generation"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "kube_daemonset_status_updated_number_scheduled|kube_deployment_created|kube_deployment_labels|kube_deployment_metadata_generation|kube_deployment_status_observed_generation|kube_deployment_status_replicas|kube_deployment_status_replicas_available|kube_deployment_status_replicas_unavailable|kube_deployment_status_updated_replicas"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex         = "kube_job_created|kube_job_labels|kube_job_metadata_generation|kube_job_status_active|kube_job_status_completion_time|kube_job_status_failed|kube_job_status_start_time|kube_job_status_succeeded"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "go_gc_duration_seconds_count|grafana_kubernetes_monitoring_build_info"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "kubelet_pleg_relist_interval_seconds_bucket"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_start_duration_seconds_sum|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_duration_seconds_count|kubelet_runtime_operations_duration_seconds_sum|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_inodes"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "kubelet_running_containers|kubelet_running_pods|kubelet_running_pods_per_container|kubelet_running_pods_per_pod|kubelet_running_pods_per_pod_container|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubernetes_build_info|process_heap_bytes|process_virtual_memory_bytes|promhttp_metric_handler_requests_in_flight|pv_hourly_cost|volume_manager_total_volumes"
    }
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "namespace_cpu:kube_pod_container_resource_requests:sum|opencost_build_info|pod_pvc_allocation|process_cpu_seconds_total|process_max_fds|process_open_fds|process_resident_memory_bytes|process_start_time_seconds|process_cpu_system_seconds_total|process_cpu_user_seconds_total|process_virtual_memory_max_bytes|promhttp_metric_handler_requests_total"
    }
skl commented

@tcborg check this example out for Custom Metrics Tuning and see if it helps?
https://github.com/grafana/k8s-monitoring-helm/tree/main/examples/custom-metrics-tuning

I don't think there's any limit to the number of rules. I'll check with the Alloy team.

In the meantime, is there a reason you wouldn't combine the metric list into a single list? Or, utilize the built-in functionality for metrics management, like @skl mentioned.

For example, instead of setting:

metrics:
  extraMetricRelabelingRules: |-
    rule {
      action        = "drop"
      source_labels = ["__name__"]
      regex = "kubelet_pleg_relist_interval_seconds_bucket"
    }

which will add that rule to every metric processing component, unnecessarily adding cpu load, you can just do this:

metrics:
  kubelet:
    metricsTuning:
      excludeMetrics: [kubelet_pleg_relist_interval_seconds_bucket]