Rechunk memory usage seems over `max_mem` limit
Closed this issue ยท 2 comments
Hi there ๐
I'm trying out xarray-beam to do some Zarr rechunking. I'm wondering if anyone has any tips to control max_mem
used in the rechunk operation.
I first tried rechunking a 128GB Cmip6 dataset and ran into an OOM error around 120GB of ram.
Next, I tried rechunking a much smaller dataset (~10GB) and the memory consumption climbed to about 34GB before finishing.
Completing 10GB Zarr store (with high memory usage)
I'm basing my rechunking pipeline off of the ERA5_rechunking.py example in the docs, using the local runner and using the max_mem
default of 1 GB.
Small dataset example:
import apache_beam as beam
import xarray_beam as xbeam
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/EOBS-feedstock/eobs-wind-speed.zarr'
# store = 'https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr'
output_store = 'eobs_test.zarr'
source_dataset, source_chunks = xbeam.open_zarr(store)
template = xbeam.make_template(source_dataset)
target_chunks = {'time':80, 'latitude':350, 'longitude':511}
itemsize = max(variable.dtype.itemsize for variable in template.values())
with beam.Pipeline(runner="DirectRunner", options=beam.pipeline.PipelineOptions(["--num_workers", '2'])) as root:
(
root
| xbeam.DatasetToChunks(source_dataset, source_chunks, split_vars=True)
| xbeam.Rechunk(
source_dataset.sizes,
source_chunks,
target_chunks,
itemsize=itemsize,
)
| xbeam.ChunksToZarr(output_store, template, zarr_chunks=target_chunks)
)
Thanks!
I think this is probably an issue with the DirectRunner, which stores all data in memory rather than writing to disk.
Thanks @shoyer! I tried with a pyspark runner and the OOM issues disappeared.