plger/scDblFinder

scDbliFinder.sample is different from the sample column specified in `scDblFinder` function `samples` column ?

Closed this issue · 17 comments

Hi, I have run scDblFinder in "split" smaple mode to detect doublets with following code (since the data is large, I only provide code):

set.seed(221113L)
sce_qc <- scDblFinder::scDblFinder(
    sce_raw[, !sce_raw$low_lib_size],
    clusters = TRUE, dims = 50L, 
    samples = "Sample", multiSampleMode = "split",
    returnType = "sce"
)

When I check the results, the scDblFinder.sample column seems strange:

data.frame(colData(sce_qc)) %>%
    dplyr::select(Sample, scDblFinder.sample) %>% 
    dplyr::filter(Sample != scDblFinder.sample)
# here is some output
                   Sample scDblFinder.sample
AAACCCAAGCCTCTCT-1    B4T              B16T2
AAACCCAAGTGTAGAT-1    B4T                B1T
AAACGCTGTGTATTGC-1    B4T              B14T2
AAAGTGAGTAGATCGG-1    B4T               B16U
AACAAAGGTGGATCGA-1    B4T                B1U
AACAAGAGTCTACATG-1    B4T              B14T1
AACCAACAGGTAAACT-1    B4T                B1T
AACGGGAGTGAGATCG-1    B4T              B14T2
AAGAACATCTCTCGCA-1    B4T               B12T
AAGATAGAGCCTCATA-1    B4T                B1U
AAGATAGAGTAAGACT-1    B4T                B1T
AAGATAGCAAATGGCG-1    B4T               B16U
AAGGAATGTTGAATCC-1    B4T               B12U

I don't know why they are different when I used a "split" mode? From the help page of scDblFinder, "split" mode runs all process separated by samples, I think they should be the same, is it right?

plger commented

Thanks a lot for reporting this, yes they should be the same. Fortunately the error was only in the reporting, and shouldn't have affected the doublet scores.

It error should be fixed now on the github version (would be happy if you could confirm with your dataset), and I'll push it to Bioc devel once the checks have passed.

plger commented

Hi @Yunuuuu ,
could you confirm that this solved your problem? Will close the issue if there's no answer.
Pierre-Luc

Hi, I downloaded the latest plger/scDblFinder using pak::pkg_install and restart R, it remains here:

image

I checked the source code of scDblFindeer function, which indicates this has been modified:

image

I try to understand the code, but I'm not familiar with the internal function:
image

when samples is not NULL and returnType is "sce" or "full", following code won't run in scDblFinder funtion:

        if (returnType == "counts") {
            for (s in names(d)) d[[s]]$sample <- s
            return(do.call(cbind, d))
        }
plger commented

You're absolutely right, I did this too quickly... should hopefully be fixed for real in the latest push :)

plger commented

@Yunuuuu , hopefully everything is as expected now?

I'll try this again @plger

It remains here:
image

the package GithubSHA1 is here:
image

Thanks for the development of this package @plger, I'll do more test this weekend, I cannot find what's wrong now

plger commented

Hi @Yunuuuu , okay now I don't get why you're having this problem, as I can't reproduce it with my toy data. Could you share a minimal example, e.g. SCE with only count matrix and sample id, only 2-300 genes, perhaps subsampling the cells? (you can rename genes & remove other cell metadata if you're worried about the data)

Is there any method to share rds data ?

plger commented

You can email it to pierre-luc.germain@hest.ethz.ch if it's <20mb, otherwise if you don't have a platform for sharing of larger files you can write me an email and I'll send you some details.
Thanks!

hi, I have uploaded it to the Google Drive,and the link has been emailed to pierre-luc.germain@hest.ethz.ch. I can confirm this data can induce the problem. Thanks!

[R]> set.seed(221113L)
[R]> anyDuplicated(colnames(test_data))
[1] 3466
[R]> sce_qc <- scDblFinder::scDblFinder(
         test_data,
         clusters = TRUE, dims = 50L,
         nfeatures = 2000L,
         samples = "Sample",
         multiSampleMode = "split",
         returnType = "sce"
     )
There were 26 warnings (use warnings() to see them)

[R]> data.frame(colData(sce_qc)) %>%
         dplyr::select(Sample, scDblFinder.sample
     ) %>% 
         dplyr::filter(Sample != scDblFinder.samp
     le) %>% 
         head()
                      Sample scDblFinder.sample
TTTCCTCTCAACTCTT-1   sample3            sample2
GTCAAACTCCACGAAT-1   sample3            sample1
GGTTAACCAGCGCTTG-1   sample3            sample2
AGCATCATCGGCTTGG-1.1 sample3            sample1
TGGAACTGTGACAGCA-1.1 sample3            sample1

It seems the column cell names matters, for I have some duplicated column names ? By changing colnames with colnames(test_data) <- paste0("cell_", seq_len(ncol(test_data))), this problem can be figured out.


[R]> colnames(test_data) <- paste0("cell_", seq_l
     en(ncol(test_data)))
[R]> anyDuplicated(colnames(test_data)) 
[1] 0
[R]> set.seed(221113L)

[R]> sce_qc <- scDblFinder::scDblFinder(
         test_data,
         clusters = TRUE, dims = 50L,
         nfeatures = 2000L,
         samples = "Sample",
         multiSampleMode = "split",
         returnType = "sce"
     )
There were 28 warnings (use warnings() to see them)

[R]> # logNormCounts
     data.frame(colData(sce_qc)) %>%
         dplyr::select(Sample, scDblFinder.sample
     ) %>% 
         dplyr::filter(Sample != scDblFinder.samp
     le) %>% 
         head()
[1] Sample             scDblFinder.sample
<0 rows> (or 0-length row.names)
plger commented

Ok, thanks @Yunuuuu , that explains a lot.
I'm afraid I'm going to have to throw an error msg on duplicated colnames, because I need to match the cells with the original object (to provide the full original object with added slots).

@plger Thanks a lot, enforcing unique colnames have already solved this.