ocbe-uio/DIscBIO

Freezing on DEGanalysis

wleoncio opened this issue · 1 comments

Describe the bug

There are some circumstances that cause the DEGanalysis function to hang.

To Reproduce

  1. Download the pan_indrop_matrix_8000cells_18556genes dataset
  2. From the download directory, run R
  3. Run the following code in R (it may take a few minutes):
library(DIscBIO)
load("data/pan_indrop_matrix_8000cells_18556genes.rda")

# ==============================================================================
# Determining contants
# ==============================================================================
n_genes <- 500
K <- 3

# ==============================================================================
# Subsetting and formatting datasets
# ==============================================================================
sc_dataframe <- pan_indrop_matrix_8000cells_18556genes[, seq_len(n_genes)]
sc <- DISCBIO(sc_dataframe)

# ==============================================================================
# Performing operations
# ==============================================================================
MIinExp <- mean(rowMeans(sc_dataframe, na.rm=TRUE))
MinNumber <- round(length(sc_dataframe[1, ]) / 10)
sc <- Normalizedata(
    sc, mintotal=1000, minexpr=MIinExp, minnumber=MinNumber, maxexpr=Inf,
    downsample=FALSE, dsn=1, rseed=17000
)
sc <- FinalPreprocessing(sc, GeneFlitering="ExpF", export=FALSE, quiet=TRUE)
sc <- Clustexp(sc, cln=K, quiet=TRUE)
sc <- comptSNE(sc, rseed=15555, quiet=TRUE)

# This is the part that freezes
cdiff <- DEGanalysis(
    sc, Clustering="K-means", K=K, fdr=0.10, name="all_clusters",
    export=FALSE, quiet=FALSE, plot=FALSE, nresamp=5, nperms=10
)
  1. Observe that the code freezes with the output of DEGanalysis being:
The dataset is ready for differential expression analysis[1] "Cl2" "Cl1" "Cl3"
Number of comparisons:  6 
Estimating sequencing depths...
Resampling to get new data matrices...
perm= 1
perm= 2
perm= 3
perm= 4
perm= 5
perm= 6
perm= 7
perm= 8
perm= 9
perm= 10
Number of thresholds chosen (all possible thresholds) = 1283
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...
'select()' returned 1:many mapping between keys and columns
Low-regulated genes in the Cl1 in Cl2 VS Cl1
'select()' returned 1:many mapping between keys and columns
Up-regulated genes in the Cl1 in Cl2 VS Cl1
Estimating sequencing depths...
Resampling to get new data matrices...

Ctrl+C doesn't quit the function, only killing R does the trick.

Expected behavior

For a simpler set of parameters, for example n_genes <- 100 and K <- 2, the code above ends with the following output:

Up-regulated genes in the Cl1 in Cl2 VS Cl1
  Comparisons Target cluster Gene number                                   File name Gene number                                   File name
1  Cl2 VS Cl1            Cl1         477  Up-regulated-all_clustersCl1inCl2VSCl1.csv         941 Low-regulated-all_clustersCl1inCl2VSCl1.csv
2  Cl2 VS Cl1            Cl2         477 Low-regulated-all_clustersCl2inCl2VSCl1.csv         941  Up-regulated-all_clustersCl2inCl2VSCl1.csv

Moreover, the structure of cdiff is:

List of 2
 $ : chr [1:1422, 1:2] "ENSG00000005022" "ENSG00000006327" "ENSG00000008394" "ENSG00000008517" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "DEGsE" "DEGsS"
 $ :'data.frame':       2 obs. of  6 variables:
  ..$ Comparisons   : chr [1:2] "Cl2 VS Cl1" "Cl2 VS Cl1"
  ..$ Target cluster: chr [1:2] "Cl1" "Cl2"
  ..$ Gene number   : int [1:2] 477 477
  ..$ File name     : chr [1:2] "Up-regulated-all_clustersCl1inCl2VSCl1.csv" "Low-regulated-all_clustersCl2inCl2VSCl1.csv"
  ..$ Gene number   : int [1:2] 941 941
  ..$ File name     : chr [1:2] "Low-regulated-all_clustersCl1inCl2VSCl1.csv" "Up-regulated-all_clustersCl2inCl2VSCl1.csv"

Software metainformation

  • Operating System: Ubuntu 18.04
  • R version: 4.0.0
  • DIscBIO version or commit number: 0.99.6

Fixed on caa84d9 (dev branch). Bug was caused by a Fortran-compiled function in the SAMR package. It was replaced by an equivalent (and much faster) R function. Closing.