HelenaLC/muscat

muscat_analysis don't accept DICE annotation

Closed this issue · 15 comments

Hello,
I've annotated my sce with BlueprintDB database in SingleR and it's working.
When I try to run muscat_analysis on the same sce annotated with DICE database, muscat stop the analysis during calcExprFreqs.
But I'm able to run "manually" get_DE-info().
I've fully compare the two databases but I don't find the bug.

Error Message :
Error in rownames<-(*tmp*, value = all_genes) :
attempt to set 'rownames' on an object with no dimensions
In addition: Warning message:
In FUN(X[[i]], ...) :
For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment.

Any idea?

Best regards,
Olivier

Please provide a reproducible example, otherwise this is hard to understand and I can not help… also I’m struggling with understanding a couple sentences. E.g., what do you mean by “for celltype NA gene names got lost”? Have you tried setting these to something like “untitled”?

I just replace the NA gene names by "unknown" and now I've this error in:

Capture

rem: i'm very new here, is it possible to put big file as example in GitHub?

NA gene name? I thought you said celltype name? …please be specific. - no, I didn’t mean uploading the data. But putting line by line the code leading up the error as well as relevant intermediate outputs that might help understand what’s causing the issue.

Here is an abstract of the code I use:

sce <- as.SingleCellExperiment(seuratObject, assay="SCT")
refDICE <- DatabaseImmuneCellExpressionData()
SubDICE <- SingleR(test=sce, ref=refDICE, labels=refDICE$label.fine)
sce@colData$annotationSubDICE<-SubDICE$pruned.labels
sce$annotationSubDICE[is.na(sce$annotationSubDICE)]<-"unknown"
sce$treatment <- dplyr::recode(sce$SampleID,
  ABE157 = "Im1wt", ABE158 = "Im1wt", ABE159 = "Im1no", ABE160 = "Im1no", ABE161 = "Im1Im1", ABE162 = "Im1Im1",
  ABE163 = "wtwt", ABE164 = "wtwt", ABE165 = "wtno", ABE166 = "wtno", ABE167 = "wtIm1", ABE168 = "wtIm1",
  ABE169 = "Im2wt", ABE170 = "Im2wt", ABE171 = "Im2no", ABE172 = "Im2no", ABE173 = "Im2Im2", ABE174 = "Im2Im2"
)
contrasts_oi <- c("'Im1Im1-(Im1wt+Im1no)/2','Im2Im2-(Im2no+Im2wt)/2','wtwt-(wtIm1+wtno)/2'")
contrast_tbl <- data.frame(contrast = c("Im1Im1-(Im1wt+Im1no)/2","Im2Im2-(Im2no+Im2wt)/2","wtwt-(wtIm1+wtno)/2"), group = `c("Im1Im1","Im2Im2","wtwt"))`
cluster_id="annotationSubDICE"
sample_id="SampleID"
group_id="treatment"
muscat <- muscat_analysis(
     sce = sce,
     celltype_id = celltype_id,
     sample_id = sample_id,
     group_id = group_id,
     covariates = NA,
     contrasts_oi = contrasts_oi,
     contrast_tbl = contrast_tbl,
     assay_oi_pb = "counts",
     fun_oi_pb = "sum",
     de_method_oi = "edgeR",
     min_cells = 10,
     verbose=TRUE)

When I don't replace the NA celltype name with "unknown", I got this message:

"[1] "Calculate expression information"
Error in rownames<-(*tmp*, value = all_genes) :
attempt to set 'rownames' on an object with no dimensions
In addition: Warning message:
In FUN(X[[i]], ...) :
For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment."

After replacement of the NA celltype with "unknown", I got this message:

Capture

When I do exactly the same with the BlueprintDB database, it works perfectly well.

muscat_analysis is not part of the muscat package, so I have no idea what it is doing. If you wrote it as a wrapper, please provide the code that is actually being run.

I'm sorry for this misunderstanding. Anyway, thank you for your help. I will ask to the muscatWrapper developer.
Best regards,
Olivier

Yes, agreed. I wasn’t aware there was sich a package… I’m curious though: Why use a muscat wrapper when it’s a couple lines to run the analysis yourself? Should be easier to also understand any issues that might occur… I don’t see the benefit here, because muscat is already a wrapper around pseudobulking+edgeR (for example).

I'm sorry to insist but I would like to understand what happen so I redo the analysis with your pipeline.
I obtain exactly the same "pb" objects with two differents celltype annotations (group_id).
When I run pbMDS() with one celltype annotation I got this error message:

Error in if (median(f75) < 1e-20) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected

Why I've a library with zero size from on celltype annotation and not with the others?

sce <- as.SingleCellExperiment(seuObjFull, assay="SCT")
sceDICE <- prepSCE(sce1, kid = "annotationSubDICE",gid = "treatment",sid = "SampleID",drop = TRUE)
nkDICE <- length(kidsDICE <- levels(sceDICE$cluster_id))
nsDICE <- length(sidsDICE <- levels(sceDICE$sample_id))
names(kidsDICE) <- kidsDICE; names(sidsDICE) <- sidsDICE
sceBlue <- prepSCE(sce1, kid = "annotation",gid = "treatment",sid = "SampleID",drop = TRUE)
nkBlue <- length(kidsBlue <- levels(sceBlue$cluster_id))
nsBlue <- length(sidsBlue <- levels(sceBlue$sample_id))
names(kidsBlue) <- kidsBlue; names(sidsBlue) <- sidsBlue

# # Data overview
t(table(sceDICE$cluster_id, sce$sample_id))
t(table(sceBlue$cluster_id, sce$sample_id))

pbDICE <- aggregateData(sceDICE,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
> pbDICE
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(13): B cells, naive NK cells ... T cells, CD8+, naive,
  stimulated unknown
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
> pbMDS(pbDICE)
Error in if (median(f75) < 1e-20) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected

pbBlue <- aggregateData(sceBlue,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
> pbBlue
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(15): CD4+ T-cells CD4+ Tcm ... Plasma cells Tregs
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
  • Zero library sizes mean that some cluster-sample combinations do not express any features. Different annotations will not give the exact same pseudobulks (pb), so it is not impossible this might happen.
  • Could you perhaps print the output of table(sce$cluster_id, sce$sample_id) for the different annotations - maybe there are missing combinations? Also worth checking the library sizes to see where the evil lies, e.g., via sapply(assays(pb), \(.) range(colSums(.))) for both pbBlue and pbDICE.

(*To avoid confusion: you said "two different celltype annotations group_id". But in the context of this analysis, group_id = experimental condition and cluster_id = cell type / subpopulation annotation.)

*to avoid confusion: group_id = experimental conditions and cluster_id = cellType annotation (from "Blue" and "DICE" database).

>t(table(sceBlue$cluster_id, sceBlue$sample_id))
Blue	CD4+ T-cells	CD4+ Tcm	CD4+ Tem	CD8+ T-cells	CD8+ Tcm	CD8+ Tem	CLP	CMP	DC	Erythrocytes	GMP	MEP	NK cells	Plasma cells	Tregs
ABE157	2	25	4755	0	119	6	0	0	17	0	0	0	3	0	61
ABE158	4	36	5503	0	149	10	0	0	13	0	0	0	2	0	75
ABE159	3	26	2957	0	101	13	0	0	3	0	0	0	2	0	30
ABE160	0	11	1728	0	57	2	0	0	1	0	0	0	1	0	9
ABE161	0	29	3977	0	172	3	0	0	26	3	0	1	2	1	1020
ABE162	0	22	6838	0	292	2	0	0	31	2	0	1	0	0	1537
ABE163	1	50	4273	0	152	5	0	0	14	11	1	6	1	4	883
ABE164	0	49	4658	0	180	4	0	0	29	10	1	4	0	0	885
ABE165	1	15	6878	0	205	10	0	0	5	0	0	0	0	0	87
ABE166	0	22	6554	0	149	4	0	0	4	0	0	0	0	0	63
ABE167	0	34	7230	0	139	6	0	0	26	7	1	2	0	0	446
ABE168	1	27	7148	0	124	7	0	0	13	6	0	4	0	0	380
ABE169	2	65	3190	1	126	3	0	1	48	22	1	2	0	0	440
ABE170	3	65	4192	0	199	1	0	2	41	18	0	6	0	2	548
ABE171	1	54	6428	0	110	1	0	0	40	5	0	6	0	1	150
ABE172	3	46	4788	0	55	2	1	0	26	5	0	1	0	0	108
ABE173	4	101	4553	1	248	5	1	0	22	19	2	6	3	0	622
ABE174	7	101	4354	0	224	4	0	0	15	18	1	4	0	1	536
>t(table(sceDICE$cluster_id, sceDICE$sample_id))
DICE	B cells, naïve	NK cells	T cells, CD4+, memory TREG	T cells, CD4+, naïve stimulated	T cells CD4+, naïve	T cells, CD4+, TFH	T cells, CD4+, Th1	T cells, CD4+, Th1_17	T cells, CD4+, Th17	T cells, CD4+, Th2	T cells, CD8+, naïve	T cells, CD8+, naïve stimulated
ABE157	0	42	2041	0	106	8	1415	26	210	1040	0	89
ABE158	0	52	2378	1	112	7	1850	27	212	1038	0	99
ABE159	0	42	1139	0	52	9	984	18	164	691	0	29
ABE160	0	16	643	0	36	0	581	17	122	372	0	21
ABE161	0	45	4177	0	91	6	725	3	14	46	0	122
ABE162	0	77	6808	0	189	5	1297	13	13	61	0	240
ABE163	0	0	3329	0	635	1	1143	15	30	23	1	219
ABE164	0	2	3437	0	1009	1	1067	15	35	24	0	226
ABE165	0	1	2327	0	189	4	4305	81	72	78	0	141
ABE166	1	1	2256	0	166	10	3950	89	105	84	0	131
ABE167	0	5	3816	0	1361	1	2298	33	20	28	0	325
ABE168	0	4	3537	0	1301	0	2416	31	28	27	0	360
ABE169	0	4	2117	0	583	2	964	11	1	15	0	203
ABE170	0	0	2902	0	637	9	1275	16	5	12	0	218
ABE171	0	5	2972	0	710	7	2782	59	16	42	0	187
ABE172	0	9	2024	0	621	4	2091	60	22	49	0	137
ABE173	0	1	2418	0	1840	9	936	23	7	7	0	336
ABE174	0	1	2198	0	1919	8	807	20	7	11	0	279
> str(lapply(assays(pbBlue), \(.) range(colSums(.))))
List of 15
 $ CD4+ T-cells: num [1:2] 0 89319
 $ CD4+ Tcm    : num [1:2] 136239 1234767
 $ CD4+ Tem    : num [1:2] 21289702 84754447
 $ CD8+ T-cells: num [1:2] 0 12653
 $ CD8+ Tcm    : num [1:2] 635327 3367515
 $ CD8+ Tem    : num [1:2] 11002 144722
 $ CLP         : num [1:2] 0 12936
 $ CMP         : num [1:2] 0 24806
 $ DC          : num [1:2] 12880 608048
 $ Erythrocytes: num [1:2] 0 277991
 $ GMP         : num [1:2] 0 25527
 $ MEP         : num [1:2] 0 77227
 $ NK cells    : num [1:2] 0 35863
 $ Plasma cells: num [1:2] 0 50804
 $ Tregs       : num [1:2] 114381 18990948
> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 12
 $ B cells, naive                  : num [1:2] 0 0
 $ NK cells                        : num [1:2] 0 801166
 $ T cells, CD4+, memory TREG      : num [1:2] 7993607 81680996
 $ T cells, CD4+, naive            : num [1:2] 0 0
 $ T cells, CD4+, naive, stimulated: num [1:2] 454821 23211410
 $ T cells, CD4+, TFH              : num [1:2] 0 111842
 $ T cells, CD4+, Th1              : num [1:2] 7064002 48864037
 $ T cells, CD4+, Th1_17           : num [1:2] 31132 968705
 $ T cells, CD4+, Th17             : num [1:2] 11144 2486473
 $ T cells, CD4+, Th2              : num [1:2] 77373 12475597
 $ T cells, CD8+, naive            : num [1:2] 0 0
 $ T cells, CD8+, naive, stimulated: num [1:2] 261069 4276299

So it's a little more clear. If I understand, I need to remove the cellType 1, 4, & 11 because there is only 1 cells in these cellTypes.
But, just for my understanding, how is it possible to have no gene expressed while I have at least 1 cells?

Thanks again for your help.
Olivier

Hello,
Finally, I choose 4 cellType with enough cells (the "keep" ones) to run pbDS analysis.
But pbDS() excluded 2 of these cellTypes (1&4 in the list above are rejected) while I have enough cells to run DE analysis.
Could you tell me why?

Thanks a lot

> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 4
 $ T_cells_CD4+_memory_TREG     : num [1:2] 7993607 81680996
 $ T_cells_CD4+_naive_stimulated: num [1:2] 454821 23211410
 $ T_cells_CD4+_Th1             : num [1:2] 7064002 48864037
 $ T_cells_CD8+_naive_stimulated: num [1:2] 261069 4276299
	ABE157	ABE158	ABE159	ABE160	ABE161	ABE162	ABE163	ABE164	ABE165	ABE166	ABE167	ABE168	ABE169	ABE170	ABE171	ABE172	ABE173	ABE174
	NK_cells	42	52	42	16	45	77	0	2	1	1	5	4	4	0	5	9	1	1
Keep	T_cells_CD4+_memory_TREG	2041	2378	1139	643	4177	6808	3329	3437	2327	2256	3816	3537	2117	2902	2972	2024	2418	2198
Keep	T_cells_CD4+_naive_stimulated	106	112	52	36	91	189	635	1009	189	166	1361	1301	583	637	710	621	1840	1919
	T_cells_CD4+_TFH	8	7	9	0	6	5	1	1	4	10	1	0	2	9	7	4	9	8
Keep	T_cells_CD4+_Th1	1415	1850	984	581	725	1297	1143	1067	4305	3950	2298	2416	964	1275	2782	2091	936	807
	T_cells_CD4+_Th1_17	26	27	18	17	3	13	15	15	81	89	33	31	11	16	59	60	23	20
	T_cells_CD4+_Th17	210	212	164	122	14	13	30	35	72	105	20	28	1	5	16	22	7	7
	T_cells_CD4+_Th2	1040	1038	691	372	46	61	23	24	78	84	28	27	15	12	42	49	7	11
Keep	T_cells_CD8+_naive_stimulated	89	99	29	21	122	240	219	226	141	131	3
contrastDICE1 <- makeContrasts(c("Im1Im1-(Im1wt+Im1no)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE2 <- makeContrasts(c("Im2Im2-(Im2no+Im2wt)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE3 <- makeContrasts(c("wtwt-(wtIm1+wtno)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE<-cbind(contrastDICE1,contrastDICE2,contrastDICE3)

DEDICE<-pbDS(pbDICE, design = mmDICE, contrast = contrastDICE)

It's really hard to grasp the data from what you're posting... But I'd say that from the cell counts by cluster-sample you cannot say directly whether or not you have enough cells to run differential analyses. This depends on how many cells there are per group or, otherwise put, how many samples have sufficiently many cells in each group. At the very least, two samples per group (depending on min_samples) need to have enough cells (depending on min_cells). Then, depending on filtering, these also have non-zero counts for some genes. Overall, it's not as simple as saying "I have enough cells". In addition, I would advice to caution: Even if there are just enough cells, I wonder whether you should "trust" results from testing a hand full of cells across many samples... Instead, a better approach might be do lower the resolution, i.e., merge some biologically similar subpopulation into broader groups such that they have a decent number of cells across all/most samples. Hope that all makes sense?

So even if I have enough cells/samples in a cell cluster, I should consider that no pbDS output for that cluster could be due to lack of differential gene expression rather than lack of cells .

No, you will get results per gene (p-values, FDR, logFCs) independent of whether or not there's anything differential. But a cluster not being tested (though it has many cells in total) could indicate that there are not enough cells per sample per group... As I said: There need to be at least 2 samples in each group with at least N cells that express a gene.

I've read some documentation on prepSIM and edgeR
In the muscatWrapper the settings for pb::edegeR are:
filterByExpr.min.count = 7, filterByExpr.min.total.count = 15, filterByExpr.large.n = 4, filterByExpr.min.prop = 0.7

And in prepSIM:
min_count = 1,
min_cells = 10,
min_genes = 100,
min_size = 100,

So now it make senses why I'm losing cluster (cell subpopulation).

Thanks again