bucket resampler get_max, get_min are not optimally parallelised
gerritholl opened this issue · 1 comments
Despite impressive improvements in #368, there remains room to improve the parallelisation of the BucketResampler
methods get_max
and get_min
. Much of the method is spent in the @dask.delayed
-function _get_statistics
, which is not parallelised, even though it could be in principle.
Perhaps the resample_blocks
function (to be) introduced in #341 could be of help.
Code Sample, a minimal, complete, and verifiable piece of code
"""Test bucket resampler performance."""
from pyresample import create_area_def
from pyresample.bucket import BucketResampler
import dask.array as da
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler, visualize
import numpy as np
from math import prod
import xarray as xr
src_area = create_area_def(
"src", 4087, area_extent=[-3_200_000, -3_200_000, 3_200_000, 3_200_000], resolution=1000)
dest_area = create_area_def(
"dest", 4087, area_extent=[-3_000_000, -3_000_000, 3_000_000, 3_000_000], resolution=5000)
data = xr.DataArray(
da.arange(prod(src_area.shape)).reshape(src_area.shape)-10000,
dims=("y", "x"),
attrs={"area": src_area})
resampler = BucketResampler(dest_area, *src_area.get_lonlats(chunks=1000))
m = resampler.get_max(data)
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof:
dask.compute(m)
visualize([prof, rprof], show=False, save=True, filename="br-max-dask.html")
Problem description
The bucket resampler methods get_max
and get_min
are spending most of the wall clock time in unparallelised code. This means the run takes longer than it needs to.
Expected Output
I expect a dask visualisation that illustrates that 800% CPU is used 100% of the time.
Actual Result, Traceback if applicable
In reality, 800% CPU is used less than 40% of the time. More than 60% of the time is spent in the task _get_statistics
, which is exactly the @dask.delayed
-decorated function used to calculate the maximum:
Versions of Python, package at hand and relevant dependencies
pyresample main: v1.23.0-46-g0cb8914
The key difficulty in the algorithm, as much as I understand of it at least, is that is needs to do an argsort
and it does this over the entire input data array. The reason the resample_blocks
function might help is that it runs your function for a single sub-array of the input data that will cover the output chunk/block. This is really want the algorithm "wants" in its current implementation, but coding up that logic is complicated in a basic map_blocks
call. We can let resample_blocks
do that for us in the future.