immunogenomics/cna

samplem conceptual questions

JBreunig opened this issue · 2 comments

I've greatly enjoyed using Harmony and hope to employ Symphony and CNA moving forward. So thank you in advance for help!

Currently, I'm stymied. I have two scRNA-seq samples from mouse brain, one is young and the other aged that largely overlap but that appear to have clear differences in terms of cluster membership and gene expression. But I get the following after running CNA: "no neighborhoods were significant at FDR < 0.05" and I want to check how I can make sure that something isn't incorrect in my code.

I have a few conceptual questions as I troubleshoot:

  1. Can CNA be used for just two samples or do I need more?

  2. I use my leiden cluster number as 'id'...is this appropriate? (This was based on my understanding of the toy data).

  3. Specifically, when I compare the toy data samplem to my data's samplem, I see decimal values in my data. Is this expected? (batchd and sample_labeld are effectively equivalent as they are '0' or '1' based on being the young or old sample.

e.samplem.head(10)
Out[52]:
batchd sample_labeld
id
10 0.403175 0.403175
0 0.440617 0.440617
1 0.451058 0.451058
3 0.446673 0.446673
20 0.013333 0.013333
6 0.537338 0.537338
7 0.434397 0.434397
4 0.427083 0.427083
5 0.429498 0.429498
2 0.411871 0.411871
Comparison code
res = cna.tl.association(e, #dataset
e.samplem.batchd, #sample-level attribute of intest (case/control status)
None, #covariates to control for (in this case just one)
None)

d.samplem.head(10)
Out[58]:
case male batch
id
0 0.0 0.0 0.0
1 0.0 0.0 1.0
2 0.0 0.0 2.0
3 0.0 0.0 3.0
4 0.0 0.0 4.0
5 0.0 0.0 0.0
6 0.0 0.0 1.0
7 0.0 0.0 2.0
8 0.0 0.0 3.0
9 0.0 0.0 4.0

Hi @JBreunig, thank you for your interest in CNA! I'll address your questions in order:

  1. CNA is a statistical association testing method designed for comparing samples profiled with single-cell data. Unfortunately, it is not possible to draw statistical inferences about cell states significantly associated with mouse age from a comparison of two samples (i.e. we cannot know what differences between these two mice are due to their age contrast versus due to quirks of these two specific mice). In our publication, we recommend that CNA only be applied to datasets of at least 10 samples.

  2. In our tutorial, the unique identifier per sample is stored under the label 'id' and this is the label CNA looks for by default. It is possible to use a different label than 'id' and tell CNA to look for that label instead, but whatever vector you provide will be interpreted by CNA as containing one unique value per sample.

  3. d.samplem is designed to hold information about the samples in the dataset, and yes those sample-level variables can be continuous (take non-integer values). d.samplem will contain the same number of rows as there are samples in the dataset.

Best wishes for your work!

Ahhh, that greatly clarifies my misconceptions, thank you! I will try on another dataset of 14 samples.