speed up calcNhoodDistance

Question

speed up calcNhoodDistance

zktuong opened this issue 4 years ago · 8 comments

Hi,

it seems that the bottleneck for this function is in the sapply function.

the following function(s) should enable parallelization:

# mc version for sapply
mcsapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) {
  FUN <- match.fun(FUN)
  answer <- parallel::mclapply(X = X, FUN = FUN, ...)
  if (USE.NAMES && is.character(X) && is.null(names(answer))) 
    names(answer) <- X
  if (!isFALSE(simplify) && length(answer)) 
    simplify2array(answer, higher = (simplify == "array"))
  else answer
}

# with progressbar
pbmcsapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) {
  FUN <- match.fun(FUN)
  answer <- pbmcapply::pbmclapply(X = X, FUN = FUN, ...)
  if (USE.NAMES && is.character(X) && is.null(names(answer))) 
    names(answer) <- X
  if (!isFALSE(simplify) && length(answer)) 
    simplify2array(answer, higher = (simplify == "array"))
  else answer
}

so it will look like this if used:
line 65

if(any(names(reducedDims(x)) %in% c("PCA"))){
        nhood.dists <- pbmcsapply(1:ncol(nhoods(x)),
                              function(X) .calc_distance(reducedDim(x, "PCA")[non.zero.nhoods[non.zero.nhoods[,'col']==X,'row'], c(1:d),drop=FALSE]), mc.cores = parallel::detectCores())
        names(nhood.dists) <- nhoodIndex(x)

    } else if(is.character(reduced.dim)){
        nhood.dists <- pbmcsapply(1:ncol(nhoods(x)),
                              function(X) .calc_distance(reducedDim(x, reduced.dim)[non.zero.nhoods[non.zero.nhoods[,'col']==X,'row'], c(1:d),drop=FALSE]), mc.cores = parallel::detectCores())
        names(nhood.dists) <- nhoodIndex(x)
    }

Parallelizing across 24 cores for 200k cells says it will take 40min ... ? Though I'm not sure how long it normally runs otherwise.

I can put together a PR for this as well and you can check to see if the output is correct.

Answer 1 · 2021-01-22T18:16:24.000Z

Hi Kelvin - could you hold off on this please. In general, there can be a large memory cost when using parallelised apply functions that will need extensive checking, likewise for the additional overhead of multiple processes.

If you want to make a new branch and do some bench-marking then please feel free and report back here.

Answer 2 · 2021-01-22T18:24:18.000Z

I agree with @MikeDMorgan and I'll have a think about an efficient refactoring before going down the parallelization route, this piece of code looks like it could be slimmed down...

Answer 3 · 2021-01-22T18:59:38.000Z

Ah ok. yea the 40 min was too optimistic... slowed down considerably now. I'll have to rethink this.

Answer 4 · 2023-03-28T13:02:03.000Z

Hi all, bumping this as I am also finding that this step makes Milo horribly slow for with larger datasets... (I've been waiting for 30 minutes now with 7 of 8 cores sitting idle.)

Could there be a way to implement this without memory checking, but with a suitable warning?
e.g. allow a BPPARAM argument, that is only permitted to be parallel if you also set accept_all_memory_risk = TRUE? 😉
Or whatever is a sensible approach.

Thanks!
Will

Answer 5 · 2023-03-28T13:05:24.000Z

There is no longer any need to calculate distances - the recommendation is to use refinement_scheme="graph" in makeNhoods and add fdr.weighting="graph-only" to testNhoods. This uses an update graph-based approximation and dispenses with distance calculations. A manuscript is in development to describe this further, but the functionality is included in the current Bioc release (Bioc 3.16).

Answer 6 · 2023-04-17T14:11:20.000Z

Hey, trying this out and fdr.weighting = "graph-only" doesn't seem to be an option for testNhoods. Could this be a typo for graph-overlap?

Thanks

sessionInfo():

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.6 (Ootpa)

Matrix products: default
BLAS/LAPACK: /apps/rocs/2020.08/cascadelake/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_skylakexp-r0.3.9.so

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] patchwork_1.1.2             scater_1.26.1
 [3] miloR_1.6.0                 fgsea_1.24.0
 [5] ggrepel_0.9.3               ggbeeswarm_0.7.1

Answer 7 · 2023-04-17T14:42:23.000Z

@wmacnair yes, sorry it should be "graph-overlap"

Answer 8 · 2023-04-17T14:47:47.000Z

Great, thanks for the quick confirmation :)