scverse/napari-spatialdata

Enhancement idea: Large multiscale / pyramidal image visualisation

rtubelleza opened this issue · 9 comments

Hello, great package - keen on integrating spatialdata and this napari interface potentially with some tools I'm developing.


Current Limitation
I notice there is a potentially missed opportunity for improving the visualisation of large multiscale images. Currently I have a spatialdata with a MultiscaleSpatialImage, over scale0 to scale4 (equal downsampling of 2x between scales). Using Interactive(sdata), visualisation of this image is quite slow to render.

In napari_spatialdata/utils/_utils.py, _adjust_channels_order for MultiscaleSpatialImage instances returns only scale0 (the full resolution image). In napari, viewer.add_images allows for list of arrays - where all scales could be passed, allowing the viewer to render the appropriate resolutions for a given zoom, rather than having render the entire large image array.


Potential Solution/Enhancement
This solution with zarr stores and LRU caching has worked well for me in a separate napari plugin I've made: https://forum.image.sc/t/what-is-the-proper-way-to-display-very-large-images-in-napari/77276/4, but I'm not sure how this would fit in the with the programming design, since 1) _adjust_channels_order right now is expected to give a single DataArray, and 2) if there should be a way to cache the dataarray/xarray images (i.e. creating a zarr store for these) for the images contained within the SpatialData object

Hey @rtubelleza, good spot:) I was aware of this , but didn't have the time yet to actually adjust this and also benchmark the asynchronous rendering that is now in napari. For adjust channels order it wouldn't be a problem to adjust it for multiscale. Regarding point 2 I am also not certain whether this would be the way to go particularly also when dealing with highly multiplexed imaging as this would be a lot of data duplication if we create another zarr store for it (unless I misunderstand you).

One thing to note, we do return a list of arrays:

    if isinstance(new_raster, (MultiscaleSpatialImage, DataTree)):
        list_of_xdata = []
        for k in new_raster:
            v = new_raster[k].values()
            assert len(v) == 1
            xdata = v.__iter__().__next__()
            list_of_xdata.append(xdata)
        new_raster = list_of_xdata

I will do some benchmarking

One question, are you trying remote reading (as in this is the reason why you implement the LRUstorecache)?

Ok I created some artificial data

big_data = np.random.randint(100,55555, size=(1,150000, 150000), dtype=np.uint16)
big_image_model = ImageModel2D.parse(big_data, dims=("c","y","x"), scale_factors=[2,2,2,2], chunks=(1,1000,1000))

I add it to an artificial SpatialData object and then visualize:

2024-05-21.16-18-38.mov

One thing I also checked is whether I could enable async, but using napari 0.4.19 this results in an error. With napari from main it works nicely though. It would be good to know in what kind of environment you are trying to visualize.

Thanks for the replies,

One thing to note, we do return a list of arrays:

My bad, I completely missed the last part of that function since I assumed the outputs from the type hint

One question, are you trying remote reading (as in this is the reason why you implement the LRUstorecache)?

No, mainly for visualisation performance. I found that zooming and panning is slightly more fluid with this. Currently, exclusively of napari-spatialdata, I load large tif files as zarr stores, and add a LRUStoreCache layer. I pass the zarr.core.Arrays directly to the viewer. Example code:

import tifffile
import zarr
import napari
with tifffile.imread("../large_image.tif", aszarr=True) as store:
    store = zarr.LRUStoreCache(store, max_size=2**30)
    data= zarr.open(store, mode='r')
    store.close()

viewer = napari.Viewer()
viewer.add_image(
    data, 
    rgb=False, 
    name="test")
viewer.dims.set_current_step(value=0, axis=0)
napari.run()

I don't see much memory usage issues with this approach.

When I parse my images into SpatialData > launch with napari_spatialdata its slower than the above (likely since the above is reading from disk?). I've tried parsing the images as both numpy and dask arrays and still slower than above.

I've found this to be true for a smaller-scale (so it fits in my smaller computer memory) of your artificial data example, even without the LRUCache. Below is the code I ran, if you want to run your own benchmarks as well

import numpy as np
from spatialdata.models import Image2DModel
from spatialdata.transformations import Identity

np.random.seed()
big_data = np.random.randint(100,55555, size=(5, 8000, 8000), dtype=np.uint16)
big_image_model = Image2DModel.parse(
    big_data, dims=("c","y","x"), scale_factors=[2,2,2,2], chunks=(1,1000,1000), 
    transformations={"mirror": Identity()}) # Rename, transformation example

Zarr/Viewer Example:

from spatialdata import SpatialData
import imageio
import tifffile
import zarr
import napari

# Data duplication .. 
imageio.volwrite("big_data.tif", big_data)

with tifffile.imread("big_data.tif", aszarr=True) as store:
    #store = zarr.LRUStoreCache(store, max_size=2**30) # with or without.
    zobj = zarr.open(store, mode='r')
    store.close()

viewer = napari.Viewer()

viewer.add_image(
    zobj, 
    rgb=False, 
    name="test")
viewer.dims.set_current_step(value=0, axis=0)

napari.run()

Sdata Example:

from napari_spatialdata import Interactive

sdata = SpatialData(
            images={"test": big_image_model})
_ = Interactive(sdata)

I understand this might be beyond the scope or current priorities of the napari-spatialdata team, however, it would be great to see this in the future (or other alternative approaches), as it would form a solid foundation for other plugins and packages.

I would have to see a little bit how this would fit with the DataTree which is ultimately what the multiscale images are. I will discuss it with the team. At least for remote reading this should also give some benefit, but we do need proper benchmarks in place.

Thanks @rtubelleza for your interest in the package.

One thing to note, we do return a list of arrays:

    if isinstance(new_raster, (MultiscaleSpatialImage, DataTree)):
        list_of_xdata = []
        for k in new_raster:
            v = new_raster[k].values()
            assert len(v) == 1
            xdata = v.__iter__().__next__()
            list_of_xdata.append(xdata)
        new_raster = list_of_xdata

I will do some benchmarking
I confirm, the _adjust_channels_order() function already returns a list of DataArray objects, just the typing of the return is incorrect. I fixed this in this small PR: #269.

We should do some benchmarks and compare the two approaches.

In my experience (i.e. Macbook Pro from a few years ago with at least 16 RAM), the user experience for images of any size when the number of channel is != 3 is good enough. Not as fluid as Photoshop, so to say, but very usable to visualization and data annotation.

Exception may arise in the following scenario, that we should consider for benchmarks:

  • RGB images: here we need to load 3 times the amount of data from disk since in Zarr the various channels are stored in separate groups.
  • data from network storage: here the IO speed is compromised.
  • workstations with lower specification as the ones we use for development.

Hello,

Thanks for the updates. I recently reviewed this and realised the issue: I was working mainly with spatialdata objects without a written in-disk zarr store (since the plugins I'm working on operate on the image at various scales, and I was testing to use spatialdata as the main data model to compartmentalise images, shapes, tables, etc).

When writing the spatialdata object with just the image alone to a zarr store, the performance was comparable. The caching seemed to achieve something similar but directly on the raw image file.

The zarr writing is probably the main bottleneck but opened in this issue: scverse/spatialdata#577

This can be closed, thanks.