zarr storage is slow

EDITED:
DirectoryStore is slow, presumably due to saving to disk. However, MemoryStore is also really slow using some parameters. (namely when you use it for a dataset which does not have any parallelism, or few threads nworkers=0 or nworkers=1.)

We need to figure out why and how the user can avoid this behavior.

ask me for proof of this if you need.

Hi @bkmi , by any chance have you tried to increase the chunk size (chunksize argument when initialising the store)? The default value of 1 might not be optimal (I think suggested values are such that each chunk is ~100 MB).

Performance depends on the details. For very simple data, and very simple networks, the stores are the bottleneck. If you have a way to improve that, I would be very happy to hear more. For reasonably complex data (images, with CNN embedding network, for example), GPU loads are high (between 50% and 100%), which seems to indicate that in that case training is the bottleneck.

Regarding the comment from fnattino, chuncksize is indeed important. But please keep it at 1. We use random access. Any larger chuncksize degrades performance.

If we give up full random access and train partially in order of storage, we might be able to leverage larger chuncksizes: One could try to use chunck sizes = minibtach size, and then read in twice (once for the data, and once for the contrastive parameters) per training step. That might lead to a significant speed-up of the DirectoryStore. The tricky thing is that we have to decide the chucksize when we set up the store. That reduces flexibility with the minibatch size during training.

This is a big problem. It is absurd how long it takes to load data using this storage method. It is an issue with memory store as well.

A simple example. The store has 100,000 items in it. Each is a 4096 vector. There is only one processor used in loading the data.

It takes 10s per loop. This is something which should be imperceptibly fast. I can make another example if you need it but I recommend just filling a store with random values and trying to calculate the mean of a subset of them.

%%timeit

dataset_ = swyft.Dataset(
    1024, 
    prior, 
    store, 
    simhook=inject_polarizations,
)

loader = torch.utils.data.DataLoader(
    dataset_,
    batch_size=512,
    num_workers=0,
    shuffle=False,
)

mean = 0.
meansq = 0.
for data in loader:
    xxx = data[0]["time_domain_strain"]
    vvv = data[0]["time_domain_strain"]
    mean = xxx.mean(dim=0)
    meansq = (xxx**2).mean(dim=0)

std = torch.sqrt(meansq - mean**2)
print("mean: " + str(mean))
print("std: " + str(std))
print()

You can generate a matrix and do the command in numpy in 24 ms.

%%timeit
a = np.random.rand(1024, 4096)
a.mean(axis=0)

Maybe we need an object that lives between store and dataset? I think it's likely that accessing zarr with Dataset is just too slow.

I had a first look at the issue and created this script based on the notebook that @bkmi has shared with us. In addition to the SWYFT dataset and the Torch-based dataset, I have added similar datasets for reference, based on Numpy arrays and Zarr arrays (with a Memory Store). The last should represent the "target" performance for the SWYFT dataset, which is also based on a Zarr Memory Store.

It gives me the following timings (in seconds):
Torch dataset: 0.670860767364502
Swyft dataset: 14.468579769134521
Zarr dataset: 5.982049942016602
Numpy 0.991995096206665

So clearly Numpy- and Torch-backed datasets are much faster. But also there is quite a gap between the "plain" Zarr dataset and the SWYFT one. I could narrow down this gap to the following points:

the simulations are saved in the SWYFT store via a Zarr Group, which contains all the arrays with the simulation output:

swyft/swyft/store/store.py

Line 209 in 7b374cf

self.sims = self._root[self._filesystem.sims]

Extracting the array from the group seems to be quite time consuming, and in the current version of SWIFT it takes place every time an element is accessed:

swyft/swyft/store/dataset.py

Line 100 in 7b374cf

x = {k: self._store.sims[k][i] for k in self._simkeys}

Instead, one could store the simulation arrays in a dictionary, so that accessing them becomes faster:

self.sims = {k: v for k, v in self._root[self._filesystem.sims].items()}

we can turn off compression of the chunks (set compressor=False when creating the Zarr arrays). This does not affect much datasets with small chunk sizes, but it does if larger chunks are used. If compressor is set to False, I obtain more or less the same performance if the chunk size along the sample dimension is set to 1 or to any other value.

I have tried to implement these changes I obtain a timing of 8.947084903717041 s for the SWYFT dataset. The remaining difference with the "plain" Zarr dataset should be the hypercube mapping which takes place when accessing data from the SWYFT dataset. If you agree I could open a PR with these changes.

The other aspect is that a map-style Torch dataset, like the one implemented in SWYFT, might not be the ideal solution to work in combination with the Zarr stores. An iterable-style dataset might actually be faster (it seems to be suggested for cases where random reads are expensive): https://pytorch.org/docs/stable/data.html#iterable-style-datasets

This is an important issue, but it will have to be dealt with in a future release. I appreciate all the work you've done it already @fnattino and everyone.

v0.4 ZarrStore is more minimal and very fast. The v0.3 store might be finally moved to future versions. For now I close this issue.