RFE/Question: Using unique worker ID inplace of `pidhash` label

Question

RFE/Question: Using unique worker ID inplace of `pidhash` label

Closed this issue 4 years ago · 5 comments

Hi, thank you for the project. We can see that in #8, we started using pid_hash derived from PID and start time of the process. This will lead to a prometheus metric explosion (as per the caution in https://prometheus.io/docs/practices/naming/#labels). Could we instead use something like a "worker id" as a label instead? That way, we only have a finite number of label values. Happy to contribute a PR as well.

Additional labels can be used to distinguish between multiple different instances of an application or across applications.

Answer 1 · 2020-03-11T01:14:20.000Z

@amitsaha I agree that could, in theory, lead to a metric explosion. In practice, we don't have enough churn for this to occur.

For this discussion, I assume you refer to the process ID when you say "worker id".

I tried to recall the rationale behind pid_hash and I believe this was to avoid worker ID duplicates, e.g. if a pool runs long enough or (more likely) the container being restarted then you will have overlapping process IDs reported to Prometheus. In that case, you're not able to distinguish request count for workers anymore, e.g. worker X1 (PID 33) has 50k requests and 1 day later worker X2 (PID 33) has 3k requests.

Suggestions for other solutions to this problem are welcome!

Answer 2 · 2020-03-11T01:57:33.000Z

Hi @estahn thanks for getting back on this.

Regarding the worker ID, I meant something like this. Each worker process irrespective of it's process ID is one of the N workers at any given point of time. So, if we assign a worker ID (like, 1, 2...N) to each worker, we can use that as an identifier. Then, we have a stable and a finite set of unique label values. However, as you mention that idea breaks for counter metrics. It would be fine for non-counter metrics of course since we are dealing with observations. However, may be we don't need counter metrics since prometheus gives us the count of each summary/histogram metrics for example automatically. What do you think?

Answer 3 · 2020-03-11T05:11:28.000Z

@amitsaha I see what you mean. In this case, how would you determine the number of killed/restarted processes over time? Is there a way to count the counter reset?

Answer 4 · 2020-03-13T08:02:28.000Z

@amitsaha After thinking about this a little another question came to mind. Assuming we implement this solution, how would you assign a worker ID (or index)? I see quite a couple of issues with keeping track of these indices, e.g.

Index 1: Process ID 55
Index 2: Process ID 70
Index 3: Process ID 99

If the process "Process ID 70" gets killed and a new process is forked (e.g. Process ID 130) would that be Index 2 or something else?

Answer 5 · 2020-05-13T03:41:13.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.