scverse/spatialdata-io

Improvements for Stereoseq reader

Closed this issue · 9 comments

Hi guys,

many thanks for the great package and especially Laurens, Tim and Maiia for enabling the Stereoseq support within spatialdata. I just had a call with Laurens and he suggested to document potential improvements for the stereoseq reader in this issue.

The following suggestions refer to Laurens pull request #70

  1. Since BGI released the bioinformatics V2 solution there are some slight changes in the SAW output structure in order to incorporate support for IF images. I would propose the following minor changes to make the reader compatible with the new output of the pipeline.
  • read the cellbin_gef file from StereoseqKeys.CELLCUT rather than StereoseqKeys.REGISTER
  • since there is the possibility that there are multiple mask files for each of the image channels, I would recommend selecting the correct mask file for the 'Labels' by matching it with .tiff files in the StereoseqKeys.REGISTER folder that start with either 'DAPI' or 'ssDNA'
  1. The reader takes quite a bit of time (and RAM) due to heavy files being incorporated that may not be interesting for the user. I would recommend kicking out superfluous images and making loading some data optional via arguments.
  • images that end with *_fov_stitched_transformed.tif are not used in any downstream analysis task and are untransformed replicates of regist.tif I would kick these out.
  • I would also kick out '*_tissue_cut.tif' files or make loading them optional
  • I think most users will either read the data as square bin or as cell bin, I would add a 'bin' argument that loads the cell bin or square bin (in specified size)
  1. I am not sure whether this is a user error but I think there may be a problem with reading the segmentation mask. E.g. when aggregating image intensities over the segmentation mask (according to your vignette https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/aggregation.html) I obtain a single .obs in the table slot. Is there a reason to not incorporate the cell boundaries (as specified in the .obsm) as shapes?

  2. Also not sure if a user error or related to 3). Rendering shapes with color keys from the table slot results in grey images and color keys are not correctly displayed.
    E.g. sdata.pl.render_shapes(color="total_counts").pl.show() does not display the total counts which is an .obs columns in the table slot of sdata. Do you have any idea what I may cause this issue?

Many thanks for the great community work. Let me know if you would prefer a pull request for 1).

Cheers,
Flo

Thank you @florianingelfinger for the feedback, and indeed great work from @LLehner @timtreis and Maiia! 🎉 If you cold open a pull request for point 1 it would be convenient for us, but anyway we will try to incorporate the feedback before merging the stereoseq PR to main.

Regarding the points 3 and 4, I kindly ask you if you could create a reproducible example with artificial data and open an issue in spatialdata for point 3 and one in spatialdata-plot for points 4.

I am happy to work on point 3 but currently I don't have stereoseq data on my machine, and it could be the same for point 4 for which @Sonja-Stockhaus could help.

Thank you,
Luca

Chiming in for #4, I just released a version to pypi with a bunch of fixes. Could you try and see if the issue persists?

I will begin shortly with adding 1. and 2. to the reader.

Points 1. and 2. have been added to the reader and there is a significant speed-up.

Try sdata = sio.stereoseq(path=result_path, read_square_bin=False)

How is it going with the rest?

Many thanks guys! I will test the reader update and also assess whether #3 and #4 have been fixed. Getting back to you until Monday!

@LLehner: quick feedback to the update. I am getting an error in l 108 of the stereoseq reader because the .gef file does not exist in the register folder anymore.

path_cellbin = path / StereoseqKeys.REGISTER / cellbin_gef_filename[0]
cellbin_gef = h5py.File(str(path_cellbin), "r")

I think l 107 should also use StereoseqKeys.CELLCUT instead of StereoseqKeys.REGISTER.

Oh, missed that line before. Should work now though.

Additionally, there seems a problem with the naming of the image, shapes and labels keys. The element names must be unique but the images keys and the labels keys are the same (the image file name of the *_mask.tif file). We can also have a quick call to look into it together next week if more convenient for you?

Quick update: #1 and #2 have been fixed by @LLehner 👏 I think that #3 and #4 are specific to the stereoseq reader (the aggregation works for the blob dataset) and I am happy to provide some real data if it helps!