future_map is surprisingly slow

Question

future_map is surprisingly slow

Closed this issue 4 years ago · 24 comments

library(furrr)
#> Warning: package 'furrr' was built under R version 3.4.4
#> Loading required package: future
#> Warning: package 'future' was built under R version 3.4.4
library(purrr)
plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

system.time(map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.470   0.006   0.477
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.716   0.197   0.914
system.time(parallel::mclapply(1:500, boot_lm, mc.cores = 4))
#>    user  system elapsed 
#>   0.893   0.612   0.214

What am I missing?

Answer 1 · 2018-08-21T11:19:59.000Z

I have also noticed that the first time that future_map will run will take longer than purrr::map

However, on average it is indeed faster than purrr (hopefully illustrated by the code below); I guess there is a cost to initialize the background r sessions (I am running the below on Windows);

R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

library(purrr)
library(furrr)
library(microbenchmark)
library(ggplot2)

plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

microbenchmark(map(1:500, boot_lm),
               future_map(1:500, boot_lm),
               times=100L) %>%
  autoplot()

Answer 2 · 2018-08-21T12:03:08.000Z

@vrontosc yes, future has to initially start up the r sessions on windows and then it keeps them around until the plan() changes so that initial call is slow but subsequent ones are fast. This is to be expected.

@hadley I'm going to look at this on my Mac tonight, but initial testing on a Windows machine is not showing as much of a slowdown as you're experiencing (median is a more relevant metric than mean because of the initial call that takes so long).

library(purrr)
#> Warning: package 'purrr' was built under R version 3.4.4
library(furrr)
#> Warning: package 'furrr' was built under R version 3.4.4
#> Loading required package: future
#> Warning: package 'future' was built under R version 3.4.4
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.4.4
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.4.4

plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

microbenchmark(
  map(1:500, boot_lm),
  future_map(1:500, boot_lm),
  times=50L
)
#> Unit: milliseconds
#>                        expr      min       lq     mean   median       uq
#>         map(1:500, boot_lm) 668.8876 724.5287 781.4994 761.7086 800.7949
#>  future_map(1:500, boot_lm) 325.9503 364.8292 467.7365 381.3079 431.0899
#>       max neval
#>  1126.069    50
#>  3917.181    50

Created on 2018-08-21 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.1 (2017-06-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-08-21
#> Packages -----------------------------------------------------------------
#>  package        * version date       source                         
#>  assertthat       0.2.0   2017-04-11 CRAN (R 3.4.4)                 
#>  backports        1.1.2   2017-12-13 CRAN (R 3.4.3)                 
#>  base           * 3.4.1   2017-06-30 local                          
#>  bindr            0.1.1   2018-03-13 CRAN (R 3.4.4)                 
#>  bindrcpp         0.2.2   2018-03-29 CRAN (R 3.4.4)                 
#>  codetools        0.2-15  2016-10-05 CRAN (R 3.4.1)                 
#>  colorspace       1.3-2   2016-12-14 CRAN (R 3.4.4)                 
#>  compiler         3.4.1   2017-06-30 local                          
#>  datasets       * 3.4.1   2017-06-30 local                          
#>  devtools         1.13.5  2018-02-18 CRAN (R 3.4.3)                 
#>  digest           0.6.15  2018-01-28 CRAN (R 3.4.3)                 
#>  dplyr            0.7.6   2018-06-29 CRAN (R 3.4.4)                 
#>  evaluate         0.10.1  2017-06-24 CRAN (R 3.4.4)                 
#>  furrr          * 0.1.0   2018-05-16 CRAN (R 3.4.4)                 
#>  future         * 1.8.1   2018-05-03 CRAN (R 3.4.4)                 
#>  ggplot2        * 3.0.0   2018-07-03 CRAN (R 3.4.4)                 
#>  globals          0.12.1  2018-06-25 CRAN (R 3.4.4)                 
#>  glue             1.3.0   2018-07-31 Github (tidyverse/glue@a292148)
#>  graphics       * 3.4.1   2017-06-30 local                          
#>  grDevices      * 3.4.1   2017-06-30 local                          
#>  grid             3.4.1   2017-06-30 local                          
#>  gtable           0.2.0   2016-02-26 CRAN (R 3.4.4)                 
#>  htmltools        0.3.6   2017-04-28 CRAN (R 3.4.4)                 
#>  knitr            1.20    2018-02-20 CRAN (R 3.4.4)                 
#>  lazyeval         0.2.1   2017-10-29 CRAN (R 3.4.4)                 
#>  listenv          0.7.0   2018-01-21 CRAN (R 3.4.4)                 
#>  magrittr         1.5     2014-11-22 CRAN (R 3.4.4)                 
#>  memoise          1.1.0   2017-04-21 CRAN (R 3.4.4)                 
#>  methods        * 3.4.1   2017-06-30 local                          
#>  microbenchmark * 1.4-4   2018-01-24 CRAN (R 3.4.4)                 
#>  munsell          0.5.0   2018-06-12 CRAN (R 3.4.4)                 
#>  parallel         3.4.1   2017-06-30 local                          
#>  pillar           1.2.3   2018-05-25 CRAN (R 3.4.4)                 
#>  pkgconfig        2.0.1   2017-03-21 CRAN (R 3.4.4)                 
#>  plyr             1.8.4   2016-06-08 CRAN (R 3.4.4)                 
#>  purrr          * 0.2.5   2018-05-29 CRAN (R 3.4.4)                 
#>  R6               2.2.2   2017-06-17 CRAN (R 3.4.4)                 
#>  Rcpp             0.12.18 2018-07-23 CRAN (R 3.4.4)                 
#>  rlang            0.2.1   2018-05-30 CRAN (R 3.4.4)                 
#>  rmarkdown        1.10    2018-06-11 CRAN (R 3.4.4)                 
#>  rprojroot        1.3-2   2018-01-03 CRAN (R 3.4.4)                 
#>  scales           0.5.0   2017-08-24 CRAN (R 3.4.4)                 
#>  stats          * 3.4.1   2017-06-30 local                          
#>  stringi          1.1.7   2018-03-12 CRAN (R 3.4.4)                 
#>  stringr          1.3.1   2018-05-10 CRAN (R 3.4.4)                 
#>  tibble           1.4.2   2018-01-22 CRAN (R 3.4.4)                 
#>  tidyselect       0.2.4   2018-02-26 CRAN (R 3.4.4)                 
#>  tools            3.4.1   2017-06-30 local                          
#>  utils          * 3.4.1   2017-06-30 local                          
#>  withr            2.1.2   2018-03-15 CRAN (R 3.4.4)                 
#>  yaml             2.1.19  2018-05-01 CRAN (R 3.4.4)

Answer 3 · 2018-08-21T13:03:32.000Z

I think it would be better to start those sessions in the background when you attach the package

Answer 4 · 2018-08-21T13:20:01.000Z

I should say that I don't think that this change improve the day-to-day usage of furrr by that much, but it makes it much easier to quickly sell because you can immediately show a benchmark without having to first explain the setup. You'd still want to eventually explain how furrrr works, but you can hold off that explanation until you've motivated to the reader why they should care in the first place.

I don't think systematic benchmarks are what you should be comparing here; it's the time on first run that people will look at first. (And spending an extra 500ms during package attach isn't going to be noticeable)

Answer 5 · 2018-08-22T00:52:55.000Z

This is something I've considered, but I'm a bit mixed on it.

Pros:

it makes it much easier to quickly sell

I can really get behind this, since you're right that you really don't have to explain anything to the user about how it works until their interest is piqued by the instantly faster code.

Cons:

Very much against the philosophy of future to have the developer ever set the plan() for the user. Henrik's words: "With futures, it is easy to write R code once, which later the user can choose to parallelize using whatever resources s/he has available"
I don't want to override any plan() the user has already set ahead of time. If I were to do this, it would be a conditional thing where it would only set plan(multiprocess) if the current plan was sequential. This is particularly important if the user has already set up a plan before librarying furrr that uses a remote cluster plan(cluster, workers = aws_ec2_cluster).
Consider the case of a developer programming with furrr. They could include a call to future_map() inside a function, say a feature engineering function. Then they'd be able to advertise that their function works sequentially by default, but also easily in parallel! If I made the default plan(multiprocess) when the package is loaded, I think this would remove the ability for their package to default to running it sequentially. I like the idea of the user "opting in" to parallelism. Although, if I included that plan(multiprocess) call in .onAttach() rather than .onLoad(), would it ever be called when a developer called a furrr function with ::? R Packages Loading vs Attaching suggests it would not be called. I'm not sure if this would be more confusing, or less so.

Side note) I think that on Mac, with multisession, future starts up the processes at the plan(multisession) call (which is good). But on Windows, it starts the processes at the first call to future() (which we don't like). I'll need to test a bit more and talk to Henrik about that.

Answer 6 · 2018-08-22T15:17:56.000Z

Starting up the processes at plan() time would alleviate a lot of my concerns. (Although you could still set an automatic plan on startup assuming one hadn't already been defined and get most of the advantages. I don't understand why you would call future_map() instead of map() if you wanted the code to be executed serially.)

But frankly, I find your point 3 quite scary - I would be very worried that a global setting would change how much function works in such a fundamental way. Generally, I think automatic parallelism is a pipe dream - in order to get good performance, you have to think carefully about how data will be distributed across the nodes. Additionally, given the heuristics that you have to use to figure out how to serialise the environment of f(), it's quite likely that parallelism will cause future_map() to fail in some scenarios where it would succeed if run sequentially)

Answer 7 · 2018-08-22T16:10:38.000Z

Starting up the processes at plan() time would alleviate a lot of my concerns.

I've figured out the nuances of why this wasn't working right and am asking Henrik. It's a Windows only thing. He said he'd look at it.

I don't understand why you would call future_map() instead of map() if you wanted the code to be executed serially

I would argue the reasoning for this as a package developer thinking of other package developers is that we can avoid the need for, say, .parallel = FALSE in plyr::llply(). If I can make it so that a package developer can use:

my_fun <- function(x) {
  furrr::future_map(x, run_me)
}

# runs sequentially
my_fun(x)

# runs in parallel
plan(multiprocess)
my_fun(x)

rather than having to do (maybe a simpler version of this is possible but the if statement is my main point):

my_fun <- function(x, .parallel = FALSE) {
  if(.parallel) {
    furrr::future_map(x, run_me)
  } else {
    purrr::map(x, run_me)
  }
}

# runs sequentially
my_fun(x)

# runs in parallel, I guess with multiprocess as default
# inherited from furrr if we follow this train of thought?
my_fun(x, .parallel = TRUE)

# can override with new plan
plan(cluster, workers = blah)
my_fun(x, .parallel = TRUE)

then I think that is a worthwhile reason of when one would use future_map() to encapsulate all cases.

I would be very worried that a global setting would change how much function works in such a fundamental way

I very much agree, which is why I think Henrik is all for never touching the plan as a developer, and just letting the user specify it.

given the heuristics that you have to use to figure out how to serialise the environment of f()

Working around not being able to serialize your rlang::~ has been quite fun ;)

it's quite likely that parallelism will cause future_map() to fail in some scenarios where it would succeed if run sequentially

Henrik has a whole vignette on all of the fun ways it can fail. https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html

Edit) Henrik has fixed the plan(multiprocess) on Windows issue. With any multisession-like future, the sessions are now started at the call to plan()

Answer 8 · 2018-08-22T22:19:32.000Z

With the current design of furrr, as a package author, I would never use future_map() because I can't predict what it will do. It's fundamentally "implementation"-unstable in a way that makes me very nervous.

(Also, the windows problem doesn't explain my situation since I'm on a mac)

Answer 9 · 2018-08-22T22:31:19.000Z

Is this because of the "auto" global lookup feature having the possibility of missing some required globals? Or maybe also because of the potential that some globals just can't be serialized correctly? I would love to hear more about your thoughts here.

I'm still investigating the Mac slowness. On a Mac, with multiprocess (which chooses multicore), the forked processes are always started up when future_map() is called, and then are immediately shut down. This is fundamentally different than what multisession does, starting up the sessions when plan() is called and keeping them around, but clearing them out after every future_map() call so nothing in the environment persists from one future_map() call to the next. I think this forked process startup is pretty fast, but I'm still testing.

Besides that, I think that there is some slowness with searching for globals (especially on the first call for some reason) that might be able to be improved.

Answer 10 · 2018-08-22T22:38:16.000Z

Interesting results here on my Mac.

library(furrr)
#> Loading required package: future
library(purrr)
plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

system.time(map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.492   0.007   0.508
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.810   0.145   1.088
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   1.301   0.322   0.591

Created on 2018-08-22 by the reprex package (v0.2.0).

Answer 11 · 2018-08-22T23:03:02.000Z

I think switching between in-process and out-of-process execution with a global setting is a fundamentally bad idea. There are two main reasons:

Given the semantics of R, I don't think that you can guarantee that in-process and out-of-process execution of a function will return the same result. This means that a global variable (outside the control of your function) may alter results.
For the majority of problems, efficient multiprocess computation requires thinking about what data should be transferred between parent and children processes, how long it should live there, and when it should be returned. If I have to think about that, there's no advantage to supporting in-process computation as well.

Finally, purrr already provides a way to specify sequential in-process computation, and providing synonyms should be done with care.

There are very few situations where controlling behaviour with a global variable is a good idea. I think it's ok to use global variables to control how something is printed, but a global variable should never control how something is computed. Given the well known problems with global variables, I think you need a compelling reason to use them, and I don't see one here.

Answer 12 · 2018-08-22T23:04:44.000Z

Just to be clear, the "global setting" you are talking about is the plan()?

I'm a bit confused as I don't ever set a global variable with furrr.

Answer 13 · 2018-08-22T23:38:07.000Z

Correct. plan() is a global setting which is equivalent to a global variable.

Answer 14 · 2018-08-23T00:12:29.000Z

I think switching between in-process and out-of-process execution with a global setting is a fundamentally bad idea.

I'm curious if you've given much thought to alternatives then. The "write once, run anywhere" nature of the future framework really thrives off this global setting idea. It's somewhat similar to the doParallel, doWhatever approach of the foreach package where the backend is specified by the user by the "global setting" of registerDoParallel() and friends. plyr uses this, are you implying that's a bad idea (it's fine if you think it is, I'm just exploring options)? The only difference is that you have a switch that turns on/off the use of that backend with .parallel = FALSE/TRUE, but the backend when .parallel=TRUE can still be anything the user decides.

Does this mean you'd be more comfortable if furrr had that .parallel flag and the user could still set the parallel backend with plan()? But .parallel=FALSE would just be purrr so that probably wouldn't make much sense... Looking at it through this lens, your argument of "why would I ever use future_map() sequentially?" makes a bit more sense.

Are other alternatives just more limited in scope where you've ruthlessly ensured that, say, running in parallel locally using multiple R sessions will always return the same thing as running sequentially, and that is the ONLY thing you are allowed to do? Like you said, that would be a difficult thing to ensure.

I don't see the approach of using plan() being changed in the future package, and since furrr is somewhat built on top of that, it inherits both the flexibility and issues that come with it. I'm sorry if that is not the answer you are looking for :/

Answer 15 · 2018-08-23T00:27:54.000Z

In other news, I've partially discovered why future_map() is slow, especially the first (and second) times around. A profvis() showed A LOT of compiler::cmpfun() calls on the first call. I've seen future construct large closures along the way to create the expression that get's passed on to the workers and it seems like this might be part of it.

library(furrr)
#> Loading required package: future
library(purrr)
plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

compiler::enableJIT(0)
#> [1] 3
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.069   0.036   0.564

compiler::enableJIT(3)
#> [1] 0

# compile large closures
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   1.723   0.340   0.856

# compile smaller closures now that we are on the second pass
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.865   0.218   0.607

# no extra compiling
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.997   0.269   0.471

system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   1.011   0.259   0.479

system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   1.054   0.262   0.486

Created on 2018-08-22 by the reprex package (v0.2.0).

Edit) Adding ByteCompile = TRUE to the Description of future seems to help a good bit with this

Answer 16 · 2018-08-23T14:40:06.000Z

I think the solution is simple: use purrr for in-process, and furrr for out-of-process. furrr should error out when the plan is sequential to make this distinction clear.

Additionally, what if instead of relying on future::plan(), furrr used it's own default evaluator?

Answer 17 · 2018-08-24T01:27:05.000Z

purrr for in-process, and furrr for out-of-process

I guess my instinct up until this conversation has been to create something that is completely future compliant that happens to implement parallel purrr. Flipping this, and making something that is tidyverse friendly first, that happens to build on the future framework as the backend, makes me feel alright with this restriction to not allow plan(sequential). furrr could error with a note to just use purrr in the sequential case.

I'm trying to fully understand what you mean by
a) not relying on future::plan()
b) having a default evalutator

For a), do you mean not letting the user specify the plan() at all? Or just that a default is set to start up furrr in parallel like we talked about earlier? That way the user would not immediately run into the sequential error. If we are restricting furrr to out-of-process only, I would be alright with the "default to parallel" idea, where the user could still manually set the plan to use a cluster or something else.

For b), I just want to clarify what you mean by "evaluator". In the future world, the functions multiprocess, multicore, cluster, etc are called "evaluators", and I wasn't sure if you actually meant that furrr would implement its own multifurrr evaluator. If so, what would it do? If not, are you just again implying that furrr should implicitly default the plan to plan(multiprocess)? Or something else entirely?

Answer 18 · 2018-08-24T11:40:51.000Z

Instead relying on the global evaluator set by plan(), you could maintain your own internal default, passing an explicit evaluator to every future call. That way furrr could always use a parallel evaluator, but wouldn't interfere with global user preference.

Answer 19 · 2018-08-24T11:50:10.000Z

Oh so use the evaluator argument of future() and have an internal default of multiprocess that gets passed to each future call. And the user could modify with an option to future_options(evaluator=...) if they wanted to. I'd have to think about how multilevel futures would work, but I think this makes sense.
I think for that it would just be:

future_map(
   .x = x, 
   .f = ~future_map(
      .x = .x, 
      .f = .f, 
      .options = future_options(evaluator2)
   ), 
   .options = future_options(evaluator1)
)

Edit) I do kind of like how this removes the use of the global plan(). Tbh I kind of forgot that future() had the evaluator argument which was part of my confusion.

Edit2) Since no plan() would be set up before the first future_map() call, the first call would be slow because it would have to set up the processes. Alternatively, since the default would likely be multiprocess, a dummy init call could be made in .onLoad(). This is essentially what plan() does anyways.

Answer 20 · 2019-04-09T17:46:15.000Z

Hi,
Not sure if this related, so I'm posting it here to start.

Please let me know if a new issue should be raised.

Thanks in advance!

model variable contains trained models from the caret package and the list is 2863 long.
data is a fixed data set for predict()

f2 <- function(model) data.frame(t = seq(7), pred = as.numeric(predict(model, data)))

# parallel    
plan(multiprocess(workers = 40))
tic()
furrr::future_map(as.list(model), ~f2(.x))
toc()
> 350.56 sec elapsed

# serial
plan(sequential)
tic()
furrr::future_map(as.list(model), ~f2(.x))
toc()
> 114.18 sec elapsed

Answer 21 · 2019-04-18T17:59:22.000Z

@hadley @DavisVaughan I am reading this I am not sure I understand everything. However, what I am worried about is: is furrr going to match (much faster) whatever computation I would have had with a sequential processing? Or something weird can happen in the meantime?

Answer 22 · 2020-08-03T21:19:09.000Z

plan() does now spin up processes on Windows, so that first call to future_map() is a bit faster there now.
https://github.com/HenrikBengtsson/future/blob/0df330211b5456f977963bfa288844650cca262c/NEWS#L588

I don't think there is much else to do here

Answer 23 · 2021-05-31T15:54:04.000Z

I found an interesting effect.
Counterintuitively, decreasing the number of workers improves performance.
See the decrease of workers from 16 to 8 to 4 in the following.

plan(multicore, workers = 16)

plan(multicore, workers = 8)

plan(multicore, workers = 4)

Answer 24 · 2021-06-01T11:42:55.000Z

With operations that have a total time in the millisecond region, this isn't unexpected. It takes a non-trivial amount of time to send data off to workers, retrieve data from workers, and post process it, and that's probably what you are seeing here