Generally high tumor proportion from TCGA data

Question

Generally high tumor proportion from TCGA data

SBaek613 opened this issue 2 years ago · 1 comments

Hi, again.

I was able to solve issues with running BayesPrism thanks to your help.

Now I have been using both CIBERSORTx and BayesPrism to analyze various TCGA data with single-cell matrix of my own.

The most distinct result from those tools was how BayesPrism would end up with very high proportion of tumor cells (70-90%) while CIBERSORTx usually gave 20-30% using the same sample and single cell reference.

I have tried to re-scale non-tumor cells by removing tumor proportion and scaling each sample's proportion to 1. However, with the presence of other CD45- cells like Fibroblast and endothelial, I was unable to retrieve immune cell proportion with most of the immune cells having around 1^10-6 to 1^10-3. I could have removed all CD45- cell types but with such low proportion of CD45+ cell types, there were too much fluctuation between samples.

While actual tumor cell proportion might vary between samples and tumor types, I would think that tumor proportion is probably not as high as ~80% but probably not as low as ~25%. From your paper I observed similar pattern of having high proportion of tumor cells. I am curious about your interpretation of different deconvolution tools having such wide range of tumor cell proportion results.

I am using fairly detailed cell type annotations for immune cells. Maybe that's why it was difficult to compare proportion of them between tumor types (with many outliers and fluctuations)? I would appreciate any comments or general feedbacks. Thanks!

Answer 1 · 2022-06-18T00:30:41.000Z

Thank you for your feedback. A few potential reasons are as follows.

First, the fraction inferred by BayesPrism represents the fraction of reads (rather than the cell count) of each cell
type in each bulk. As a result, cells with low total transcription level will have lower fraction of reads. Tumor may have higher amount of total transcription than other cells, such as T cell. This may contribute to the seemingly over-estimated tumor fractions. On the other hand, CIBERSORTx uses a signature matrix, and then performs deconvolution over the signature genes, and hence the fraction inferred by CIBERSORTx is over the signature genes selected, which may also cause the difference between these two methods. You may also try running BayesPrism over the signature genes selected by CIBERSORTx and then compare the results. That being said, when compare BayesPrism and CIBERSORTx with the tumor purity estimated by other methods, including IHC, ABSOLUTE and ESTIMATE, we did not seems to detect systematic overestimation for the cancer types tested by our hands (see Supplementary Fig. 2 of the paper).

The second potential cause for this is that when non-tumor cells in the reference are too few, non-tumor cells will have a sparser representation than tumor e , so that the reads in bulk will be assigned to tumor for those genes with zero expression in non-tumor cells. We also observe similar effects in T cells of GBM (see Supplementary Fig. 1e of our BayesPrism paper). Under such circumstance, although the absolute fraction will be underestimated for some cell types with too few cells, the relative fractions are still accurate. We recommend user represent each cell type with sufficient number of cells, say > 20 or even >50.

The third reason might be related to the high granularity of cell type definition in your reference. In one spatial transcriptomics dataset we tested, when the reference cell types are too similar/co-linear, the quality, e.g. number of cells representing the cell type, might have higher impact in the reference, causing some cell types to be close to zero (due to the weak/sparse prior). In fact high co-linearity will also cause the linear regression to be unstable (higher standard error in regression coefficients). If that is the case, users may merge the cell types to a granularity of higher confidence, or simply treat them as cell states, which will be summed up by BayesPrism.

Hope that I have clarified this. Let me know if there is any other questions.

Best,

Tinyi