Move detailed per CPU kernel stats under a new flag `METRICS_V2_KERNEL_COUNTERS_PER_CPU`

Question

Move detailed per CPU kernel stats under a new flag `METRICS_V2_KERNEL_COUNTERS_PER_CPU`

Closed this issue 7 months ago · 6 comments

The new detailed kernel counters per CPU are great. At the same time it can add a lot of metrics, especially for beefy server with 128+ CPUs. Therefore proposing to move them under a new flag, e.g. METRICS_V2_KERNEL_COUNTERS_PER_CPU and expose a sub-config key in Falco to allow users to opt in or opt out of these new counters.

CC @Andreagit97 WDYT?

If we all agree we should prioritize this for 0.18.0 so that we have a coherent UX.

Answer 1 · 2024-08-28T02:34:23.000Z

/milestone 0.18.0

Answer 2 · 2024-08-28T08:26:29.000Z

Yes, it makes sense, I will take care of it! Thank you for the suggestion

Answer 3 · 2024-08-28T16:47:33.000Z

Thank you Andrea!

Few more thoughts:

So far we always enabled all metrics categories in Falco, while metrics themselves are disabled by default. For this metric category was thinking we should disable it by default. What are your thoughts?
We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

Answer 4 · 2024-08-29T08:44:16.000Z

So far we always enabled all metrics categories in Falco, while metrics themselves are disabled by default. For this metric category was thinking we should disable it by default. What are your thoughts?

Yep I agree

We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

Yeah, it seems a great idea but I'm not sure what is the right place to obtain them. Definitely not an expert in metrics representation but is not possible to obtain these kinds of data directly in Prometheus or something similar starting from the metrics we expose today?

Answer 5 · 2024-08-30T03:23:41.000Z

We could also consider exposing statistical metrics in the future, such as the kurtosis or skewness of these counters, whenever a snapshot is taken in libsinsp. In some use cases, you might prefer to opt into receiving only these statistical metrics instead of all the raw counter fields. Other use cases may still require the raw counters. Happy to help with this if we all agree it's useful. Mentioning it now in case it is useful to shape the design.

Yeah, it seems a great idea but I'm not sure what is the right place to obtain them. Definitely not an expert in metrics representation but is not possible to obtain these kinds of data directly in Prometheus or something similar starting from the metrics we expose today?

On top of my head one pro could be to save the space of sending many raw metric fields for machines with many CPUs.

Answer 6 · 2024-08-30T07:49:49.000Z

that's a good point. Let's say that I see these per-CPU metrics more as a debug option than something to run in production by default, if we end up using them we are probably in a critical situation with performance issues, so the overhead introduced by them could be acceptable since we are trying to debug...so yes statistical metrics are for sure useful but I'm not 100% sure this is the right case to apply them, WDYT?