kubedl-io/kubedl

[feature request]I would like to monitor the kubedl_jobs_failed metric, but the label only supports kind and does not allow retrieving the jobName. The experience with exposed metrics is very unsatisfactory.

13241308289 opened this issue · 3 comments

I would like to monitor the kubedl_jobs_failed metric, but the label only supports kind and does not allow retrieving the jobName. The experience with exposed metrics is very unsatisfactory.

eg:

kubedl_jobs_failed{endpoint="metrics", instance="", job="kubedl", kind="marsjob", namespace="kubedl-system", pod="", service="kubedl"}


@13241308289 Hi, thanks for the feedback! The reason we didn't initially include this label was due to the limited capacity of Prometheus's data backend, which doesn't actively purge data that's been stored for an extended period. We assessed that it might not be well-suited for job scenarios. However, it seems user experience is also quite significant, so let's go ahead and add it. Would you be interested in contributing to this?

I took another look at the code, and it turns out that this metric is initialized at the controller layer, which is why it's not possible to expose the jobName label. I believe that if we want to expose specific labels like jobName, we should adopt an implementation similar to kubedl_jobs_first_pod_launch_delay_seconds. Of course, I would be very happy to implement this, as my business also has this requirement.

I took another look at the code, and it turns out that this metric is initialized at the controller layer, which is why it's not possible to expose the jobName label. I believe that if we want to expose specific labels like jobName, we should adopt an implementation similar to kubedl_jobs_first_pod_launch_delay_seconds. Of course, I would be very happy to implement this, as my business also has this requirement.

thanks for your contribution!