snap-stanford/UCE

Genes without protein LLM embeddings

Closed this issue · 3 comments

Hello @Yanay1 @yhr91 !

I was wondering how UCE handles genes that do not have a corresponding protein embedding (eg non-coding genes, or genes with symbols that don't match with the provided protein embedding reference IDs). Does it simply drop these genes during preprocessing of the anndata object?

In normal circumstances dropping these genes may not be too much of an issue due to the expression of plenty of other genes to use for inference. But I have a bit of an unusual case in that some of my "cells" (they're actually vectors of gene associations for disease traits) only have 1 gene. So if that 1 gene gets dropped, the whole trait gets dropped. Also curious to hear your thoughts on whether you have any other concerns about trying to embed 1-gene "cells" with UCE to begin with.

Is there a way to retain these genes that wouldn't introduce too much bias (genes with protein embeddings vs those without)? Or is this something that's simply too deeply ingrained within the UCE?

Thanks so much!,
Brian

Hi Brian,

Unfortunately we are limited now to just protein coding genes that have a reference AA sequence so that we can use a protein embedding for them (like in SATURN).

Even so-- I don't think the model will be able to give a meaningful embedding of a "cell" with just one gene in it-- that gene would be repeated 1024 times in the sample and it may be a very weird end result.

Makes sense, thanks! It seems like UCE may be dropping phenotypes with few genes anyway. I'm inferring this from the fact that my input anndata object has the shape 22773 × 18826, and the output object has shape 11963 × 18160 (even when i set --filter False)

@bschilder Regarding your last message, I had the same problem with genes being dropped in the resulted embedded adata even with --filter False. It seems that the additional_filter argument in data_utils/process_raw_anndata is always True, regardless of the values passed in the args. I am not sure where the problem lies exactly, but (if needed) you can just modify the function and specify additional_filter=False prior to the gene filtering step.