[advice] adding new datasets to a reference dataset
Closed this issue · 2 comments
I was not sure where in the SCE community to ask but here it goes:
Is there a tool to integrate new single cell datasets to a reference 'atlas' as to making the reference bigger?
I understand with SingleR I can use the information from the reference and give a known identity (cluster label) to the new cells. But I would like to add the new cells so that the previous representations (UMAP) are also transferred to the new cells. Something like scVI but in R.
In our work we sort and focus on a single type of cell and spend time curating a local reference dataset. Accruing the reference might be useful as more tools like geneformer emerger.
In any case any thoughts, advice on this would be great.
I would post this in a more general forum, such as the Bioconductor community slack (there is a singlecell-queries channel). All I can think of is perhaps Azimuth? Although I would caution you to not trust too much "projections" in UMAP space as, unlike PCA, the projection to UMAP is not a well-define mathematical operation (see e.g. https://link.springer.com/article/10.1186/s13059-023-03065-x#Abs1)
My 2 cents: it is risky to map new cells onto existing UMAP/t-SNE/etc. coordinates IMO.
For example, if you have a dataset with some distinct cell types A and B, a t-SNE might put the two clusters anywhere in the 2D plot. That's fine, global structure isn't well-respected by these algorithms, and we don't care. But if your new data contains some intermediate AB state that contiguously bridges A and B, the mapping procedure now needs to somehow connect A and B on the plot, which wouldn't be pretty if A and B are far apart in the current projection.
The converse is also applicable. UMAP is often claimed to be better at preserving global structure, and while I find this claim to be a little dubious, let's just assume it's true for now. If we have related but distinct cell types C and D, they might form separate clusters that are placed next to each other in the UMAP. So far so good. But if our new data contains some intermediate but still distinct type CD, there might not be a good place to put it. Literally, there is no space in the plot between C and D to form another cluster for CD. So either the mapping algorithm has to sacrifice the accurate depiction of global structure, or try to stuff CD in and create an artificial visual trajectory between C and D.
To me, it's just safer to rerun the embeddings with the combined data. You mention that you spend a lot of time annotating the dataset, but I don't see the problem; you can just re-use those annotations with the coordinates computed from the combined data. You'll just end up with an SCE with annotations (and batch numbers) in the column data, which is independent of the embeddings that you're using to visualize the data. The embeddings themselves change at a whim anyway - for example, differences in the precision of floating-point calculations between CPU architectures is enough to alter the final coordinates substantially - so I would not use them for anything quantitative.