immunogenomics/cna

Single score for each sample

repeatpipettor opened this issue · 1 comments

Hi, thanks for a great tool. I'm interested in calculating a single score for each sample to use in downstream analyses, e.g. correlating each sample's score with their expression levels of different genes of interest. What would be the best object to use to calculate this? I imagine these would be derived from the single cells of each sample that were used for the association test. Perhaps d.uns['NAM_sampleXpc'], which contains the sample loadings of the principal components of the NAM, and subset to the PCs that were used for the association test?

Hi there, we are glad to hear you're a fan of the tool!

If you are seeking to identify genes whose expression most corresponds to a cell state implicated by CNA as associated with your sample attribute of interest, there are a few approaches you could pursue.

In the result object output by the association function (I'll call it res), you will find a vector named res.ncorrs that has values for every neighborhood included in the analysis (the retained neighborhoods are specified by the vector 'res.kept'). For each neighborhood, this res.ncorrs value represents the estimate provided by CNA of the correlation between cell abundance in that neighborhood and the sample attribute of interest. Examining which genes have per-cell expression most correlated to these res.ncorrs values was the primary interpretive strategy we used in the CNA publication when applying CNA to real single-cell datasets. In this approach, you are using the gene expression profile of the anchor cell as a representation of the expression state captured by that neighborhood.

This first approach ^^ makes use of the continuous res.ncorrs values, but you could alternatively threshold res.ncorrs at the FDR level of your choice (you can find the 5% threshold in res.fdr_5p_t), to identify a set of neighborhoods that pass the threshold and can be considered the expanded cell state (or depleted, for negative res.ncorrs values). Then, for the set of anchor cells corresponding to these neighborhoods, you can employ any strategy traditionally used to characterize a set of cells in single-cell data (e.g. Wilcoxon rank-sum tests to identify differentially-expressed genes between cells within versus outside the cell set you've defined).

Using res.ncorrs is the best way to interpret the populations implicated by CNA as associated to your sample attribute of interest. If instead you want to simply understand the cell states implicated as the major axes of inter-sample variation in your dataset, then you can explore gene expression associations to neighborhood loadings on the NAM-PCs, as stored in d.uns['NAM_nbhdXpc']. It is important to note, however, that not all top NAM-PCs may have been strongly associated in the linear model to your attribute of interest; interpreting these axes in isolation only tells you about this unsupervised dataset structure and not necessarily about CNA's model for your sample attribute of interest. You can look at res.beta to see the effects assigned to each NAM-PC in the linear model for your attribute.

Finally, if you want to use the per-sample feature set provided by CNA (i.e. the sample loadings on NAM-PCs) to do any other kind of modeling or analysis, you can access that at d.uns['NAM_sampleXpc], as you suggested. This is likely the most direct answer to your original question, but I included the earlier commentary because it sounds like you may also be seeking a biological interpretation of CNA's model for your attribute of interest, in addition to using this unsupervised data representation to do your own modeling.