scverse/anndata

Indicate in on-disk HDF5 specification if indices of sparse matrices are sorted

arteymix opened this issue · 2 comments

Please describe your wishes and possible alternatives to achieve the desired result.

In the HDF5 on-disk format spec, no mention is made that a CSR/CSC matrix has to have sorted indices. We thus have to make the assumption that it is unsorted in our workflow and sort systematically or reject unsorted input.

Having it in the spec would save us the $O(n)$ time it takes to check the matrix and allow us to directly slice in $O(\log n)$.

Including has_sorted_indices as an attribute would be helpful.

I think it can already be stated that indices cannot be duplicated and thus has_canonical_format would be redundant.

Currently I believe that we don't do any checking for sortedness of indices when writing, so they may get written unsorted. We could add an attribute to the storage that notes this, or could insist on this in a future spec update (though that would be breaking).

indices cannot be duplicated

I don't believe this is the case.

import numpy as np, zarr
from scipy import sparse
from anndata.experimental import write_elem

z = zarr.group()
a = sparse.csr_matrix((np.ones(5), [2, 2, 0, 0, 1], [0, 3, 5]), shape=(2, 3))
assert not a.has_canonical_format
write_elem(z, "a", a)
display(z["a/indices"][...])
# array([2, 2, 0, 0, 1], dtype=int32)

One way to handle this would be to make a minor version bump and ensure that indices are always sorted before being written to disk and document that in the spec.

Then all I'd need to do is to check the existing encoding version attribute.

I don't think it makes much sense to allow unsorted indices to be written to disk.