High memory usage on proximity notebook
TomAugspurger opened this issue · 4 comments
The cell
extent_data = data.sel(band="extent")
extent_proximity_default = proximity(extent_data).compute()
is current failing on staging because the workers are using too much memory. The notebook output has
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:39389 -> tcp://127.0.0.1:36541
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
Here's a reproducer with just xrspatial, dask, and xarray
import dask.array as da
import xarray as xr
from xrspatial.proximity import proximity
a = xr.DataArray(da.ones((5405, 5766), dtype="float64", chunks=(3000, 3000)), dims=("y", "x"))
xrspatial.proximity(a).compute()
cc @thuydotm, does this look like an issue in xrspatial? Or do you think it might be upstream in dask?
I'm taking a look now. My initial thought is that when max_distance
goes to infinity, each chunk expands to cover the whole array (each chunk can be expanded to be even bigger than the original data array) and that would easily cause memory issues. Would you think it's better to change the API so that providing max_distance
is always required?
I'm not sure. API-wise, it seems like max_distance=None
is equivalent to max_distance=<whatever distance is larger than the shape of the array>
, which I think could be determined by the function itself.
I forgot to mention that in one of the later cells there was an error from the overlap depth being larger than the shape of the array along that dimensions. If the user provides a max_distance that's larger than the shape of the array, then I would expect proximity
to truncate the distance.
Thanks, that makes sense. I'll open a PR in xrspatial to indicate this.
Closed by makepath/xarray-spatial#558