xarray-contrib/xoak

Grabbing too much data when using a particular server, using `xoak.sel`

Opened this issue · 1 comments

kthyng commented

Hi! I am hitting a memory problem I think due to what the docs describe for xoak.sel of "This triggers dask.compute() if the given indexers and/or the index coordinates are chunked." But, is there any way to move past this given that behavior?

Here is an example to see what I mean:

import xarray as xr
import xoak
loc = "https://opendap.co-ops.nos.noaa.gov/thredds/dodsC/CIOFS/fmrc/Aggregated_7_day_CIOFS_Fields_Forecast_best.ncd"
ds = xr.open_dataset(loc, drop_variables=["ocean_time","time_run"], chunks={})
var = ds["temp"]
lon, lat = -151, 59
ds_to_find = xr.Dataset(
    {
        "lat_to_find": ("locs", [lat], {"standard_name": "latitude"}),
        "lon_to_find": ("locs", [lon], {"standard_name": "longitude"}),
    }
)
var.xoak.set_index(["lat_rho", "lon_rho"], "sklearn_geo_balltree")
output = var.xoak.sel(
        {"lat_rho": ds_to_find.lat_to_find, "lon_rho": ds_to_find.lon_to_find}
    )
output.isel(s_rho=-1).load()

Thank you!

Hmm I'm not sure what is exactly happening but Xoak's support for chunked datasets (index or indexer) is very experimental and pretty unstable. For example, dask/distributed rebuilds or replicates the same indexes in different workers, which may lead to memory issues (I've tried to improve that without much success).

Did you look at the dask/distributed diagnostics to see what's going on?

If possible, it would be better to load the whole data into memory first (or a ROI containing all locations to find).