gao-lab/GLUE

Question about PairedSCGLUEModel

alitinet opened this issue · 6 comments

Hi,

does PairedSCGLUEModel work in a way that it finds common .obs_names among all present modalities? Or is it also possible to use the paired model when e.g. integrating 3 modalities and where there are common cells only between two out of the three modalities? So the set up would be the following: we are trying to integrate a CITE-seq dataset (same cells for RNA and ADT modalities) and a CYTOF dataset (where cells are different from the CITE-seq dataset). Would the model be able to pair RNA and ADT cells? Thanks!

Hi @alitinet! Thanks for your interest in GLUE! The short answer is yes. The PairedSCGLUEModel works when there are only common cells between two out of three modalities.

It doesn't just extract common .obs_names. What it does is it takes the unique value of .obs_names from all modalities, and pairs cells with the same .obs_names no matter how many modalities the cell covers, through a matrix that we call the "pairing mask" (pmsk in the code). E.g., say we have an RNA modality with cells [A, B, C], an ADT modality with cells [B, C, D] and an ATAC modality with cells [C, D, E]. The pmsk looks like below:

RNA ADT ATAC
A 1 0 0
B 1 1 0
C 1 1 1
D 0 1 1
E 0 0 1

The pairing loss is computed based on this pmsk so it can accommodate any pairing pattern, including the setting you mentioned.

Let me know if there are any further problems!

Hi @Jeff1995,

Thanks so much for the quick reply! This is great, then a follow-up question: when using PairedSCGLUEModel , the model still outputs an embedding per cell per modality, right? So the pairing is only used to calculate the pairing loss? Or is there a way to obtain only one embedding per cell, i.e. in your example above to obtain 5 embeddings (1 per cell), and not 3+3+3=9 embeddings?

I'm afraid that's not currently supported. The model always returns all 9 embeddings. In this case I'd suggest taking the mean of paired cell embeddings.

I'll see if I can add an additional function to compute this, but for now you would have to compute this mean manually.

Got it, thanks for your prompt replies!

Great! I'll let you know when that function becomes available :)

Hi, I am also interested in integrating cite-seq (like paried 10X multiome dataset). My idea is to modify the gudiance graph based on gene-protein encoding relation and self-loops. Moreover, I model protein data based on normal distribution rather than NB. I have implemented one version and I wonder if I can open a pull-up path to upload my codes. Thanks.