rapidsai/dask-cuda

Add a cluster/worker option to log cuDF spilling statistics

charlesbluca opened this issue · 1 comments

Chatting with @quasiben, it seems like in addition to #1226, it would be useful to have some (preferably machine-parseable) method to track cuDF spilling statistics beyond the dashboard page; in our initial conversations around this, a potential implementation of this looked like an option/argument for the worker/cluster APIs to enable logging of cuDF spilling during and/or after computation:

$ CUDF_SPILL=on CUDF_SPILL_STATS=1 dask cuda worker --cudf-spill-logging tcp://10.33.227.163:8786
2023-09-29 07:36:11,333 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.33.227.163:38751'
...
2023-09-29 07:40:14,483 - distributed.worker - INFO - Worker tcp://10.33.227.163:45905 spilled 24 bytes from GPU in 0.01s
2023-09-29 07:40:14,483 - distributed.worker - INFO - Worker tcp://10.33.227.163:45905 unspilled 24 bytes to GPU in 0.01s
...
2023-09-29 07:36:14,483 - distributed.worker - INFO - -------------------------------------------------
2023-09-29 07:36:14,483 - distributed.worker - INFO -                Worker: tcp://10.33.227.163:45905
2023-09-29 07:36:14,483 - distributed.worker - INFO -         Bytes spilled:                        24
2023-09-29 07:36:14,483 - distributed.worker - INFO -   Time spent spilling:                     0.02s
2023-09-29 07:36:14,483 - distributed.worker - INFO - -------------------------------------------------
2023-09-29 07:40:38,126 - distributed.nanny - INFO - Worker process 3868728 was killed by signal 9

Imagine this could look like a worker plugin that polls the cuDF spilling statistics periodically (is there a way we could "subscribe" a worker to cuDF spilling event?) and at worker closing time, but am interested in if there's a better approach we could take here.

Imagine this could look like a worker plugin that polls the cuDF spilling statistics periodically (is there a way we could "subscribe" a worker to cuDF spilling event?) and at worker closing time, but am interested in if there's a better approach we could take here.

I don't think we have a "proper" way of doing something like this, the closest to that is probably the LoggerBuffer interface that could be plugged in to a PeriodicCallback as suggested in #442 (comment) . Other than that, I don't think there's any pre-baked solutions for this, but I agree this could be a useful feature.