Question about PairedSCGLUEModel
alitinet opened this issue · 6 comments
Hi,
does PairedSCGLUEModel work in a way that it finds common .obs_names
among all present modalities? Or is it also possible to use the paired model when e.g. integrating 3 modalities and where there are common cells only between two out of the three modalities? So the set up would be the following: we are trying to integrate a CITE-seq dataset (same cells for RNA and ADT modalities) and a CYTOF dataset (where cells are different from the CITE-seq dataset). Would the model be able to pair RNA and ADT cells? Thanks!
Hi @alitinet! Thanks for your interest in GLUE! The short answer is yes. The PairedSCGLUEModel
works when there are only common cells between two out of three modalities.
It doesn't just extract common .obs_names
. What it does is it takes the unique value of .obs_names
from all modalities, and pairs cells with the same .obs_names
no matter how many modalities the cell covers, through a matrix that we call the "pairing mask" (pmsk
in the code). E.g., say we have an RNA modality with cells [A, B, C], an ADT modality with cells [B, C, D] and an ATAC modality with cells [C, D, E]. The pmsk
looks like below:
RNA | ADT | ATAC | |
---|---|---|---|
A | 1 | 0 | 0 |
B | 1 | 1 | 0 |
C | 1 | 1 | 1 |
D | 0 | 1 | 1 |
E | 0 | 0 | 1 |
The pairing loss is computed based on this pmsk
so it can accommodate any pairing pattern, including the setting you mentioned.
Let me know if there are any further problems!
Hi @Jeff1995,
Thanks so much for the quick reply! This is great, then a follow-up question: when using PairedSCGLUEModel
, the model still outputs an embedding per cell per modality, right? So the pairing is only used to calculate the pairing loss? Or is there a way to obtain only one embedding per cell, i.e. in your example above to obtain 5 embeddings (1 per cell), and not 3+3+3=9 embeddings?
I'm afraid that's not currently supported. The model always returns all 9 embeddings. In this case I'd suggest taking the mean of paired cell embeddings.
I'll see if I can add an additional function to compute this, but for now you would have to compute this mean manually.
Got it, thanks for your prompt replies!
Great! I'll let you know when that function becomes available :)
Hi, I am also interested in integrating cite-seq (like paried 10X multiome dataset). My idea is to modify the gudiance graph based on gene-protein encoding relation and self-loops. Moreover, I model protein data based on normal distribution rather than NB. I have implemented one version and I wonder if I can open a pull-up path to upload my codes. Thanks.