Visium: Failed to mask tissue
Opened this issue · 9 comments
Hi, thank you for the great method! I try to apply it to my Visium dataset but I got the following warnings for all my samples on conversion step:
[2021-12-29 13:40:49,020] ℹ : Running xfuse version 0.2.1 [2021-12-29 13:40:56,493] ℹ : Computing tissue mask: [2021-12-29 13:40:56,500] ⚠ WARNING : UserWarning (/nfs/users/nfs_p/pm19/.local/lib/python3.9/site-packages/xfuse/utility/mask.py:67): Failed to mask tissue OpenCV(4.5.4) /tmp/pip-req-build-kv0l0wqx/opencv/modules/imgproc/src/grabcut.cpp:386: error: (-215:Assertion failed) !bgdSamples.empty() && !fgdSamples.empty() in function 'initGMMs' [2021-12-29 13:41:07,029] ⚠ WARNING : UserWarning (/nfs/users/nfs_p/pm19/.local/lib/python3.9/site-packages/xfuse/convert/utility.py:217): Count matrix contains duplicated columns. Counts will be summed by column name. [2021-12-29 13:41:09,749] ⚠ WARNING : FutureWarning (/nfs/users/nfs_p/pm19/.local/lib/python3.9/site-packages/xfuse/convert/utility.py:227): Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
I mostly worry about "Failed to mask tissue" warning. In this dataset we instructed spaceranger to consider all spots because tissue autodetection failed to find relatively transparent adipose tissue. Then we manually annotated tissue spots and I introduced this information into tissue-positions file (second column). As far as I can see xfuse ignores this information and attempts to mask tissue internaly, but this procedure fails. Am I right that in this case xfuse considers all spots? At least it looks like this based on manuall inspection of data.h5 file and high intensity of some of metagenes in out-of-tissue regions. May I force xfuse to use tissue mask provided in tissue-positions file?
Is "Count matrix contains duplicated columns" warning about gene names?
Then, when I run xfuse at some points it tells that "Registering experiment: ST (data type: "ST")" while actually it is visium data, is it important or can I just ignore it?
Hi,
Thanks for your feedback! You are right that xfuse currently ignores the second column in the spot_positions file and only uses the image to compute the tissue mask. Tissue masking does not always work, especially when the tissue is not clearly delineated from the background. It is definitely something that would be good to improve.
I have created a new branch improve-visium-masking that attempts to make use of the tissue information in the spot_positions file. Would be interesting to hear if it works better for your tissue! You can use the command pip install git+https://github.com/ludvb/xfuse.git@improve-visium-masking
to install the new branch if you'd like to try it out. We could potentially also provide a way for users to provide a custom image mask if this still doesn't work.
To visualize the mask, you can run something like:
import h5py
import matplotlib.pyplot as plt
with h5py.File("/path/to/data.h5") as d:
mask = d['label'][()] != 1
plt.imshow(mask)
plt.show()
Regarding the duplicated columns warning: Xfuse uses the HGNC IDs from the Space Ranger hdf5 file. There will be some distinct HGNC IDs that refer to multiple ENSEMBL IDs (typically corresponding to different splice variants). The counts for those ENSEMBL IDs are summed when computing the counts for each HGNC ID. This warning is expected for Space Ranger data.
Regarding experiment type: I agree this log message is confusing, ST and Visium data are in fact modeled in the same way. The "ST" experiment type is currently the only one in use.
Thanks for reporting back! And great to see that the masking works better now. I would not worry too much about the fiducials as they shouldn't impact learning too much, but imputation results may be off in those areas.
It should be possible to extract the prediction data by setting writer = "tensor"
under the gene maps config section, e.g.:
[analyses.analysis-gene_maps]
type = "gene_maps"
[analyses.analysis-gene_maps.options]
gene_regex = ".*"
writer = "tensor"
The results are saved as pickled torch.Tensors and can be loaded using torch.load
.
Something to be mindful of is that output files tend to be very large, as they store all monte carlo samples, so it may be a good idea to limit the analysis to specific genes using the gene_regex
option.
Hi, I'm hitting the same error as described in the first post above. Incidentally, I hit the error when using the improve-visium-masking
branch instead of the master
.
When running master
, most samples work fine except ~15% of them where the masks are being inverted. This usually happens in slides which are almost fully covered by tissue (and so less clear background).
To fix that, I tried the improve-visium-masking
branch which from the above discussion, utilizes the tissue-position-list file. However, I don't find that happening. I notice both tissue and background are being picked up and so masking is pretty much not happening.
Am I missing something here? Do I need to set certain config options to make this work smoothly?
Hi,
Thanks for the report and all the debugging effort so far! :) It seems the current masking procedure has several failure modes. A lot of tweaking would probably be required to make it fully robust, but I think we at least should provide a means for users to specify a custom mask. The custom mask can be annotated manually or created by more specialized tools.
I've updated the improve-visium-masking
branch. There is now a new command line flag --mask-file
which can be used to specify a single-channel image file of the same size as the --image
with the annotations 0=background, 1=foreground, 2=likely background, 3=likely foreground. If you have time to try it out, any feedback would be super helpful!
Hi,
Thanks for the new option. At the moment I won't have time to try it out so sorry about that. When reviewing the masking (for samples where it fails), I notice only a small number of pixels with certain 1=foreground values, most of them being GC_PR_FGD
. Perhaps this approach might help. I could create a mask file using the histomicstk package which has worked for me in the past.
A simple alternative in my mind is to create slightly crude mask using the tissue_position_lists.csv file in the spaceranger output. Each circular spot could be stretched into a rectangle of appropriate size to cover all the pixels (or using a better approach if you're already using one). I hope to find time for this later. Thanks!
The way the masking should work right now is that spots will be assigned as GC_FGD
or GC_BGD
based on the annotation in the tissue_positions_lists.csv file, while spots outside the tissue will be assigned as GC_PR_FGD
or GC_PR_BGD
based on the closest spot:
Lines 105 to 109 in 042e9a9
There are probably better ways to do this - if you figure something out, any contribution would be much appreciated!
One thing to keep in mind with this way of initializing the mask is that it's best to use the raw_feature_bc_matrix from Space Ranger. The filtered matrix does not contain data from spots outside the tissue, so those spots will get filtered out before the masking step here:
Line 70 in 042e9a9
This means everything will be assigned as
GC_FGD
or GC_PR_FGD
when using the filtered matrix. I'm not sure if this may be the cause of the issues you are experiencing, but we should probably add a note about this in the README or postpone filtering the tissue_positions_list until after the mask has been created.Thanks for the explanation. That might explain one of the two failure scenarios that I'm seeing. It might be worth looking at the raw_feature_bc_matrix file instead of the filtered one as I'm seeing a lot more GC_PR_FGD
than I should be.
So far this has impacted < 20% of my test samples so I'm still able to evaluate a lot of them with the current piece of code. Eventually I'll be getting back to those 20% and using the raw data matrix will be the first thing that I might try out. I will definitely keep you posted!
Yep, could be the case. Thanks for your help ironing out all the issues so far. Do keep me posted on how it goes! :)