rapidsai/dask-cuda

Failing tests on `distributed>2023.9.2`

pentschev opened this issue · 3 comments

Two tests are currently failing after removing the distributed==2023.9.2 pin:

FAILED tests/test_local_cuda_cluster.py::test_pre_import_not_found - Failed: DID NOT RAISE <class 'RuntimeError'>
FAILED tests/test_local_cuda_cluster.py::test_death_timeout_raises - Failed: DID NOT RAISE <class 'asyncio.exceptions.TimeoutError'>

We need to bisect to find the source of those regressions, but for now we're xfailing them to allow unpinning going through.

wence- commented

FAILED tests/test_local_cuda_cluster.py::test_pre_import_not_found

This one seems to be because scaling up a speccluster now swallows any errors when scaling up. (dask/distributed#8309)

wence- commented

FAILED tests/test_local_cuda_cluster.py::test_death_timeout_raises

This is the same cause.

Thanks for digging into that @wence- .

I'm sure this is not the first time this is problematic. This is definitely not something we need to do right now, but I'm wondering if we should do a pass on Dask-CUDA's tests that could be generalized and submit them as part of Distributed's test set to prevent regressions there.