/memray-array

Measuring memory usage of Zarr array storage operations using memray

Primary LanguageHTML

memray-array

Measuring memory usage of array storage operations using memray.

In an ideal world array storage operations would be zero-copy, but many libraries do not achieve this in practice. The scripts here measure what the actual empirical behaviour is across different filesystems (local/cloud), compression settings, and Zarr versions.

Summary

The workload is simple: create a random 100MB NumPy array and write it to Zarr storage in a single chunk. Then (in a separate process) read it back from storage into a new NumPy array.

  • Writes with no compression incur a single buffer copy, except for Zarr v2 writing to the local filesystem. (This shows that zero copy is possible, at least.)
  • Writes with compression incur a second buffer copy, since implementations first write the compressed bytes into another buffer, which has to be around the size of the uncompressed bytes (since it is not known in advance how compressible the original is).
  • Reads with no compression incur a single copy from local files, but two copies from S3. This seems to be because the S3 libraries read lots of small blocks then join them into a larger one, whereas local files can be read in one go into a single buffer.
  • Reads with compression incur two buffer copies, except for Zarr v2 reading from the local filesystem.

It would seem there is scope to reduce the number of copies in some of these cases.

Writes

Number of extra copies needed to write an array to storage using Zarr. (Links are to memray flamegraphs.)

Filesystem Library Zarr version Uncompressed Compressed
Local fsspec v2 0 2
v3 1 2
obstore v3 1 2
S3 fsspec v2 1 2
v3 1 2

Reads

Number of extra copies needed to read an array from storage using Zarr. (Links are to memray flamegraphs.)

Filesystem Library Zarr version Uncompressed Compressed
Local fsspec v2 1 1
v3 1 2
obstore v3 1 2
S3 fsspec v2 2 2
v3 2 2

How to run

Create a new virtual env (for Python 3.11), then run

pip install -r requirements.txt

Local

pip install -U 'zarr<3'
python memray-array.py write
python memray-array.py write --no-compress
python memray-array.py read
python memray-array.py read --no-compress

pip install -U 'zarr>3'
python memray-array.py write
python memray-array.py write --no-compress
python memray-array.py read
python memray-array.py read --no-compress

pip install -U 'git+https://github.com/kylebarron/zarr-python.git@kyle/object-store#egg=zarr'
python memray-array.py write --library obstore
python memray-array.py write --no-compress --library obstore
python memray-array.py read --library obstore
python memray-array.py read --no-compress --library obstore

S3

These can take a while to run (unless run from within AWS).

Note: change the URL to an S3 bucket you own and have already created.

pip install -U 'zarr<3'
python memray-array.py write --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py write --no-compress --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py read --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py read --no-compress --store-prefix=s3://cubed-unittest/mem-array

pip install -U 'zarr>3'
python memray-array.py write --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py write --no-compress --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py read --store-prefix=s3://cubed-unittest/mem-array
python memray-array.py read --no-compress --store-prefix=s3://cubed-unittest/mem-array

Memray flamegraphs

mkdir -p flamegraphs
(cd profiles; for f in $(ls *.bin); do echo $f; python -m memray flamegraph --temporal -f -o ../flamegraphs/$f.html $f; done)