[Core] Change the source of the ray_tasks metric for finished or failed tasks to have a more accurate count.

Question

[Core] Change the source of the ray_tasks metric for finished or failed tasks to have a more accurate count.

Opened this issue a month ago · 1 comments

What happened + What you expected to happen

Collect a global counter of num_finished or num_failed tasks in the head node to export a metric.

The current distributed counter approach runs into problems with the node dies and the node's count of total finished or failed tasks gets wiped out.

We worked around this in the grafana dashboard by doing a max_over_time for each of these counts, but that can be very slow since we scan the past 14 days of time data

Versions / Dependencies

ray 2.21.0

Reproduction script

simple repro:

import ray

ray.init("auto")

@ray.remote
def foo():
  return "hi"

ray.get([foo.remote() for _ in range(100)])

Open the grafana dashboard and go to the metrics page. See the tasks graph. If the number of tasks is very large and the cluster is alive for a long time, this graph can be too slow to even load.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Answer 1 · 2024-05-20T21:57:06.000Z

Do you have a PR for this?