microsoft/PlanetaryComputerExamples

Dask scheduler version with custom Docker image

derekrollend opened this issue · 7 comments

Hello,

I am attempting to run a modified version of the land cover classification tutorial, where I'd like to load my own segmentation_models_pytorch model that was trained on Sentinel-2 imagery. We have some custom functions we need from our own pip package, and we are using the latest segmentation_models_pytorch built from source. Thus, I have built my own Docker image based on the pangeo-docker-images/pangeo-notebook image (tag 2021.05.04), which uses both dask==2021.04.1 and distributed==2021.04.1. I have verified those versions are correct within a container.

I point the cluster to this image using options['image'] = 'drollend/apl-gpu-pytorch-notebook:latest', but I receive the distributed client VersionMismatchWarning:

+-------------+-----------+-----------+-----------+
| Package     | client    | scheduler | workers   |
+-------------+-----------+-----------+-----------+
| dask        | 2021.04.1 | 2021.05.0 | 2021.04.1 |
| distributed | 2021.04.1 | 2021.05.0 | 2021.04.1 |
+-------------+-----------+-----------+-----------+

I am unsure of where the scheduler lives and why it is a different version. Any pointers you can provide on fixing the scheduler version are much appreciated. Thanks in advance and thanks for creating the Planetary Computer!

Thanks for trying things out. If you're working on the Hub at https://pccompute.westeurope.cloudapp.azure.com/compute/hub/spawn and using Dask Gateway, there are three pods involved:

  1. The user pod, where your notebook is running. You don't currently have much control over the version of Dask used here.
  2. The scheduler pod. Created by Dask Gateway using options['image'].
  3. The worker pod. Created by Dask Gateway using options['image'].

I'm a bit surprised by the message there. I would have expected the worker and scheduler to both be 2021.05.05, and the client to be 2021.04.1. Can you verify that it's correct? Can you also do something like this to confirm the versions on the worker and scheduler?

client.run(lambda: dask.__version__)
client.run_on_scheduler(lambda: dask.__version__)

Thanks for the quick response. Here is the output of those two lines after waiting for the two workers to spin up:

client.run(lambda: dask.__version__)

{'tls://10.244.39.3:32933': '2021.04.1', 'tls://10.244.40.3:35205': '2021.04.1'}

client.run_on_scheduler(lambda: dask.__version__)

'2021.05.0'

Can you share a full example creating a client and connecting to it? Here's mine with my singleuser pod running on https://pccompute.westeurope.cloudapp.azure.com/ and using pangeo/base-notebook:2021.05.04 for the image option.

from dask_gateway import Gateway

gateway = Gateway()
options = gateway.cluster_options()

options['image'] = "pangeo/base-notebook:2021.05.04"
cluster = gateway.new_cluster(options)
client = cluster.get_client()

That prints out the warning.

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------+---------+
| Package     | client    | scheduler | workers |
+-------------+-----------+-----------+---------+
| blosc       | 1.10.2    | None      | None    |
| dask        | 2021.04.1 | 2021.05.0 | None    |
| distributed | 2021.04.1 | 2021.05.0 | None    |
| lz4         | 3.1.3     | None      | None    |
+-------------+-----------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

A few notes:

  1. It's sometimes OK to ignore those warnings. Dask does allow some flexibility between versions installed locally and on the schedulers & workers. That said, recent changes in dask and distributed have been tightly coupled, so you may need exact pinning at this date.
  2. The version of Dask on the singleuser node (where the client is running) is somewhat difficult for you to control right now. You may be better off running your client locally (on your laptop / desktop) and connecting to our Dask Gateway from outside of the hub: https://planetarycomputer.microsoft.com/docs/concepts/computing/#use-our-dask-gateway. It's not show there, but you can use your custom docker image locally, to ensure the versions match.

When I try to follow the #use-our-dask-gateway example, I can't seem to get past cluster.get_client() due to the error below. I don't see any standard way to increase the timeout, nor am I sure if that's the problem. Have you encountered this error before?

OSError: Timed out trying to connect to gateway://pccompute-dask.westeurope.cloudapp.azure.com/prod.a394972afe8f40caab667edec204026e after 10 s```

Monitoring gateway.list_clusters() from a notebook in the hub, it looks like the cluster starts up, but then our local client can't connect. Same thing happened on a DMZ machine. Maybe there's a firewall/networking issue on the hub side?

ClusterReport<name=prod.b7329799b6d24592b43013be468cdf11, status=RUNNING>
OSError: Timed out trying to connect to gateway://pccompute-dask.westeurope.cloudapp.azure.com:443/prod.b7329799b6d24592b43013be468cdf11 after 10 s

Huh, thanks for sharing that. I'll take a look at what's going on.

Apologies for the delay on this. I just got around to debugging stuff, and I think it's as simple as needing to include the port 80 in the proxy address. So the connection would be

>>> gateway = Gateway(
...     "https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/",
...     proxy_address="gateway://pccompute-dask.westeurope.cloudapp.azure.com:80",
...     auth=jupyterhub_auth,
... )

rather than

>>> gateway = Gateway(
...     "https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/",
...     proxy_address="gateway://pccompute-dask.westeurope.cloudapp.azure.com",
...     auth=jupyterhub_auth,
... )

I'll get https://planetarycomputer.microsoft.com/docs/concepts/computing/#use-our-dask-gateway updated.