scDbliFinder.sample is different from the sample column specified in `scDblFinder` function `samples` column ?
Closed this issue · 17 comments
Hi, I have run scDblFinder in "split" smaple mode to detect doublets with following code (since the data is large, I only provide code):
set.seed(221113L)
sce_qc <- scDblFinder::scDblFinder(
sce_raw[, !sce_raw$low_lib_size],
clusters = TRUE, dims = 50L,
samples = "Sample", multiSampleMode = "split",
returnType = "sce"
)
When I check the results, the scDblFinder.sample
column seems strange:
data.frame(colData(sce_qc)) %>%
dplyr::select(Sample, scDblFinder.sample) %>%
dplyr::filter(Sample != scDblFinder.sample)
# here is some output
Sample scDblFinder.sample
AAACCCAAGCCTCTCT-1 B4T B16T2
AAACCCAAGTGTAGAT-1 B4T B1T
AAACGCTGTGTATTGC-1 B4T B14T2
AAAGTGAGTAGATCGG-1 B4T B16U
AACAAAGGTGGATCGA-1 B4T B1U
AACAAGAGTCTACATG-1 B4T B14T1
AACCAACAGGTAAACT-1 B4T B1T
AACGGGAGTGAGATCG-1 B4T B14T2
AAGAACATCTCTCGCA-1 B4T B12T
AAGATAGAGCCTCATA-1 B4T B1U
AAGATAGAGTAAGACT-1 B4T B1T
AAGATAGCAAATGGCG-1 B4T B16U
AAGGAATGTTGAATCC-1 B4T B12U
I don't know why they are different when I used a "split" mode? From the help page of scDblFinder
, "split" mode runs all process separated by samples, I think they should be the same, is it right?
Thanks a lot for reporting this, yes they should be the same. Fortunately the error was only in the reporting, and shouldn't have affected the doublet scores.
It error should be fixed now on the github version (would be happy if you could confirm with your dataset), and I'll push it to Bioc devel once the checks have passed.
Hi @Yunuuuu ,
could you confirm that this solved your problem? Will close the issue if there's no answer.
Pierre-Luc
You're absolutely right, I did this too quickly... should hopefully be fixed for real in the latest push :)
Thanks for the development of this package @plger, I'll do more test this weekend, I cannot find what's wrong now
Hi @Yunuuuu , okay now I don't get why you're having this problem, as I can't reproduce it with my toy data. Could you share a minimal example, e.g. SCE with only count matrix and sample id, only 2-300 genes, perhaps subsampling the cells? (you can rename genes & remove other cell metadata if you're worried about the data)
Is there any method to share rds data ?
You can email it to pierre-luc.germain@hest.ethz.ch
if it's <20mb, otherwise if you don't have a platform for sharing of larger files you can write me an email and I'll send you some details.
Thanks!
hi, I have uploaded it to the Google Drive,and the link has been emailed to pierre-luc.germain@hest.ethz.ch
. I can confirm this data can induce the problem. Thanks!
[R]> set.seed(221113L)
[R]> anyDuplicated(colnames(test_data))
[1] 3466
[R]> sce_qc <- scDblFinder::scDblFinder(
test_data,
clusters = TRUE, dims = 50L,
nfeatures = 2000L,
samples = "Sample",
multiSampleMode = "split",
returnType = "sce"
)
There were 26 warnings (use warnings() to see them)
[R]> data.frame(colData(sce_qc)) %>%
dplyr::select(Sample, scDblFinder.sample
) %>%
dplyr::filter(Sample != scDblFinder.samp
le) %>%
head()
Sample scDblFinder.sample
TTTCCTCTCAACTCTT-1 sample3 sample2
GTCAAACTCCACGAAT-1 sample3 sample1
GGTTAACCAGCGCTTG-1 sample3 sample2
AGCATCATCGGCTTGG-1.1 sample3 sample1
TGGAACTGTGACAGCA-1.1 sample3 sample1
It seems the column cell names matters, for I have some duplicated column names ? By changing colnames with colnames(test_data) <- paste0("cell_", seq_len(ncol(test_data)))
, this problem can be figured out.
[R]> colnames(test_data) <- paste0("cell_", seq_l
en(ncol(test_data)))
[R]> anyDuplicated(colnames(test_data))
[1] 0
[R]> set.seed(221113L)
[R]> sce_qc <- scDblFinder::scDblFinder(
test_data,
clusters = TRUE, dims = 50L,
nfeatures = 2000L,
samples = "Sample",
multiSampleMode = "split",
returnType = "sce"
)
There were 28 warnings (use warnings() to see them)
[R]> # logNormCounts
data.frame(colData(sce_qc)) %>%
dplyr::select(Sample, scDblFinder.sample
) %>%
dplyr::filter(Sample != scDblFinder.samp
le) %>%
head()
[1] Sample scDblFinder.sample
<0 rows> (or 0-length row.names)