YosefLab/scone

POTENTIAL BUG: bplapply() and globals

Closed this issue · 5 comments

Just a comment after having talked about bplapply() with @drisso the other day;

Make sure to test the functions that call bplapply() with other backends than sequential and multicore too. For instance, test them with backends that run in separate background R processes, e.g. PSOCK clusters (parallel::makeCluster(2L)). Because, these require that all variables are properly exported to the worker (something you don't have to worry about when you use sequential and multicore workers).

Example:

> library("BiocParallel")
> a <- 42

> register(MulticoreParam(workers = 2))
> y <- bplapply(1:3, FUN = function(x) x + a)
> unlist(y)
[1] 43 44 45

> register(SnowParam(workers = 2, type = "SOCK"))
> y <- bplapply(1:3, FUN = function(x) x + a)
starting worker localhost:11794
starting worker localhost:11794
Error: BiocParallel errors
  element index: 1, 2, 3
  first error: object 'a' not found

Here's an example where I suspect you'll get an error because globals (zhat, muhat, coefs_pi) are not exported:

    fit_pi <- bplapply(seq_len(n), function(i) {
      fit <- suppressWarnings(glm(zhat[,i]~log(muhat), family = binomial(link = logit), start=coefs_pi[,i]))
      return(list(coefs=coefficients(fit), fitted=fitted.values(fit)))
    })

I have added (for now in develop) a set of tests running scone with many different back-ends, including BiocParallel::SnowParam(workers=2, type="SOCK").

All tests pass and I also ran the vignette with this back-end with no issue. The particular example that @HenrikBengtsson references is now gone (we decided to not pursue zinb() imputation), but there are still calls to bplapply that look suspicious, like:

  scaled <- bplapply(seq_len(NROW(sc_params)), function(i) scaling[[sc_params[i,2]]](imputed[[sc_params[i,1]]]), BPPARAM=bpparam)

However, this seems to work fine. Perhaps there is some automatic copy going on here? @HenrikBengtsson any thoughts?

Although I'm not a super savvy BiocParallel users, I don't think BiocParallel exports globals automatically, so I am a bit surprised that your most recent bplapply() snippet

scaled <- bplapply(seq_len(NROW(sc_params)), function(i) scaling[[sc_params[i,2]]](imputed[[sc_params[i,1]]]), BPPARAM=bpparam)

doesn't give an error reporting on object 'scaling' not found (or sc_params or imputed) when using an external R process as the backend such as SnowParam(workers=2, type="SOCK"). Are you 100% sure your package tests covers that piece of code?

@mtmorgan, do you have any comments?

BiocParallel is behaving like snow for these backends, so the function and the environment in which it is defined are copied. So

register(SnowParam(2))
fun0 = function(i) a[i]
p0 = function() {
    a = 1:2
    bplapply(1:2, fun0)    # fails
}

p1a = function() {
    fun1 = function(i) a[i]
    a = 1:2
    bplapply(1:2, fun1)    # works
}

p1b = function() {
    a=1:2
    bplapply(1:2, function(i) a[i])
}

p2 = function() {
    fun1 = function(i) a[i]
    a = 1:2
    x = integer(1e8)
    bplapply(1:2, fun1)    # works, expensive!
}

leads to

> bpstart(bpparam())
> p0()
Error: BiocParallel errors
  element index: 1, 2
  first error: object 'a' not found
> system.time(p1a())
   user  system elapsed 
  0.004   0.000   0.074 
> system.time(p2())
   user  system elapsed 
  1.164   0.096   1.583 
> bpstop(bpparam())

I'm pretty sure too that a function defined at the (top level) of package actually has the package namespace serialized to the workers, so all variables in the package are available.

Ah, that could be why it works, i.e. they probably have the p1b() case. Thxs Martin.

Thank you Martin!

Now I have a much better idea of what's going on. Which makes me worried about the large objects that are defined and hence exported. But that's a different issue, so I should close this one.