rapidsai/cloud-ml-examples

22.08 nightly container does not launch Dask scheduler properly

Closed this issue · 5 comments

hcho3 commented

The EC2 MNMG notebook currently uses the stable 21.06 container (rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04-py3.8) and is able to launch the EC2 cluster successfully.

However, when I replace it with the latest nightly container (rapidsai/rapidsai-nightly:22.08-cuda11.5-runtime-ubuntu20.04-py3.9), the EC2 cluster fails to launch. For some reason, the container fails to initialize the Dask container at port 8786. (I waited more than 3 hours and the scheduler still didn't come up at at 8786.)

TODO. Investigate why python -m distributed.cli.dask_scheduler fails on the latest nightly container.

I wonder if the environment variable DISABLE_JUPYTER needs to be set to true, the RAPIDS docker image might not be starting Dask at all if it is just blocking on Jupyter as the foreground process.

cluster = EC2Cluster(env_vars={"DISABLE_JUPYTER": "true", **get_aws_credentials()},
                     ...

xref rapidsai/docker#425 but that change was done in January so I'm surprised we aren't seeing these issues in 22.06 too.

hcho3 commented

The current notebook uses 21.06. When I switched to 22.06, I got the same issue.

hcho3 commented

Indeed, after setting DISABLE_JUPYTER=true, I observe the Dask scheduler launching successfully. I will incorporate this in my pull request. Thanks!

Ah yup, I misread your initial comment as 22.06, but if we are upgrading from 21.06 that makes a lot of sense.

@hcho3 just going through old issues, can this be closed out now?