MarioniLab/miloR

speed up calcNhoodDistance

zktuong opened this issue ยท 8 comments

Hi,

it seems that the bottleneck for this function is in the sapply function.

the following function(s) should enable parallelization:

# mc version for sapply
mcsapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) {
  FUN <- match.fun(FUN)
  answer <- parallel::mclapply(X = X, FUN = FUN, ...)
  if (USE.NAMES && is.character(X) && is.null(names(answer))) 
    names(answer) <- X
  if (!isFALSE(simplify) && length(answer)) 
    simplify2array(answer, higher = (simplify == "array"))
  else answer
}

# with progressbar
pbmcsapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) {
  FUN <- match.fun(FUN)
  answer <- pbmcapply::pbmclapply(X = X, FUN = FUN, ...)
  if (USE.NAMES && is.character(X) && is.null(names(answer))) 
    names(answer) <- X
  if (!isFALSE(simplify) && length(answer)) 
    simplify2array(answer, higher = (simplify == "array"))
  else answer
}

so it will look like this if used:
line 65

if(any(names(reducedDims(x)) %in% c("PCA"))){
        nhood.dists <- pbmcsapply(1:ncol(nhoods(x)),
                              function(X) .calc_distance(reducedDim(x, "PCA")[non.zero.nhoods[non.zero.nhoods[,'col']==X,'row'], c(1:d),drop=FALSE]), mc.cores = parallel::detectCores())
        names(nhood.dists) <- nhoodIndex(x)

    } else if(is.character(reduced.dim)){
        nhood.dists <- pbmcsapply(1:ncol(nhoods(x)),
                              function(X) .calc_distance(reducedDim(x, reduced.dim)[non.zero.nhoods[non.zero.nhoods[,'col']==X,'row'], c(1:d),drop=FALSE]), mc.cores = parallel::detectCores())
        names(nhood.dists) <- nhoodIndex(x)
    }

Parallelizing across 24 cores for 200k cells says it will take 40min ... ? Though I'm not sure how long it normally runs otherwise.

I can put together a PR for this as well and you can check to see if the output is correct.

Hi Kelvin - could you hold off on this please. In general, there can be a large memory cost when using parallelised apply functions that will need extensive checking, likewise for the additional overhead of multiple processes.

If you want to make a new branch and do some bench-marking then please feel free and report back here.

I agree with @MikeDMorgan and I'll have a think about an efficient refactoring before going down the parallelization route, this piece of code looks like it could be slimmed down...

Ah ok. yea the 40 min was too optimistic... slowed down considerably now. I'll have to rethink this.

Hi all, bumping this as I am also finding that this step makes Milo horribly slow for with larger datasets... (I've been waiting for 30 minutes now with 7 of 8 cores sitting idle.)

Could there be a way to implement this without memory checking, but with a suitable warning?
e.g. allow a BPPARAM argument, that is only permitted to be parallel if you also set accept_all_memory_risk = TRUE? ๐Ÿ˜‰
Or whatever is a sensible approach.

Thanks!
Will

There is no longer any need to calculate distances - the recommendation is to use refinement_scheme="graph" in makeNhoods and add fdr.weighting="graph-only" to testNhoods. This uses an update graph-based approximation and dispenses with distance calculations. A manuscript is in development to describe this further, but the functionality is included in the current Bioc release (Bioc 3.16).

Hey, trying this out and fdr.weighting = "graph-only" doesn't seem to be an option for testNhoods. Could this be a typo for graph-overlap?

Thanks

sessionInfo():

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.6 (Ootpa)

Matrix products: default
BLAS/LAPACK: /apps/rocs/2020.08/cascadelake/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_skylakexp-r0.3.9.so

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] patchwork_1.1.2             scater_1.26.1
 [3] miloR_1.6.0                 fgsea_1.24.0
 [5] ggrepel_0.9.3               ggbeeswarm_0.7.1

@wmacnair yes, sorry it should be "graph-overlap"

Great, thanks for the quick confirmation :)