Dashboard for cuDF spilling
pentschev opened this issue · 3 comments
When cuDF spilling is enabled we may eventually get allocation errors when 100% of the memory is unspillable. The information about cuDF spilling memory usage is very useful, but failures are perhaps delayed to a point where that information doesn't help us understanding why we got there.
With the above said, a dashboard that allows us to look at both cuDF's device memory consumption and spilled (host) memory usage in real-time would help us in understanding where we have pressure, allowing to see the history of cuDF memory usage is definitely a bonus as it would permit us seeing what happened even if workers die abruptly.
Note, to get statistics for cuDF spilling the user needs to set an environment flag/config option:
https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#statistics
This was maybe closed by dask/distributed#8148?
I think so too, I'll close this and we can open a new issue if anything else is missing. Thanks for checking this here @TomAugspurger !