DavisVaughan/furrr

`furrr` much slower than `purrr` on nested data

Closed this issue · 5 comments

Hello,

I would like to use furrr package to (row-wise) make some analysis of the data. I find that using furrr is slower than purrr, which is also reported by @hadley here: #41.

Here is a repex

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession)

# Create some large dataset
Data <- vector(mode = "list", length = 50000)
Data <- lapply(1:50000, function(x){
  mtcars0 <- sample(mtcars)
  mtcars0$ID <- x
  Data[[x]] <- mtcars0
  }) %>% do.call(what = rbind) %>% tibble()

Here, I implement a simple function for each row of the data. There is a time difference, but not a huge difference.

SimpleFun <- function(x){ x*sample(1:100,1) }
tictoc::tic()
Data %>% mutate(disp2 = map_dbl(disp, SimpleFun))
tictoc::toc()
# 2.004 sec elapsed

tictoc::tic()
Data %>% mutate(disp2 = future_map_dbl(disp, SimpleFun, .progress = T))
tictoc::toc()
# 21.126 sec elapsed

However, when implementing another simple function on a nested dataset, it works fine using purrr but takes ages to run (if works altogether) using furrr, which is weird.

SimpleFun2 <- function(x){sample(x)}

tictoc::tic()
Data %>% group_by(ID) %>% nest() %>% 
  mutate(data2 = map(data, SimpleFun2))
tictoc::toc()
# 5.011 sec elapsed

tictoc::tic()
Data %>% group_by(ID) %>% nest() %>% 
  mutate(data2 = future_map(data, SimpleFun2))
tictoc::toc()
# did not work

I tested this on two different R installations (Windows - R4.2.0 furrr0.3.0 --- and RstudioServer/R3.6/furrr0.2.3)

Is there a reason for this? Any advice to make a parallel analysis of nested datasets faster?

Cheers,
Ahmed

For the first one, it is the progress bar that is taking so long. The progress bar should mainly be used for things where each individual iteration takes a relatively large amount of time, otherwise the overhead of the progress bar outweighs its usefulness.

Also note that the progress bar is deprecated, and should not really be used anymore. I will eventually remove it in favor of the progressr package.

I'm not surprised that furrr is slower here. When the total time is < 5 seconds or so, I expect map() to basically beat future_map() every time.

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)

# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))

Data
#> # A tibble: 1,600,000 × 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    ID
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     1
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     1
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     1
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     1
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1
#> # … with 1,599,990 more rows

tictoc::tic()
xx <- Data %>% mutate(disp2 = map_dbl(disp, identity))
tictoc::toc()
#> 0.943 sec elapsed

tictoc::tic()
xx <- Data %>% mutate(disp2 = future_map_dbl(disp, identity))
tictoc::toc()
#> 1.538 sec elapsed

Created on 2022-05-12 by the reprex package (v2.0.1)

I'll address the second question in a moment...

For the second question, you just forgot to ungroup() after the nest(). If you give nest() a grouped data frame, it remains grouped after the nesting (for better or worse). This is preventing future_map() from doing what it is good at - partitioning the data over the workers. Because there are 50,000 groups, it is calling future_map() 50,000 times. This also makes map() run slower too.

It is exactly the problem outlined in the Common Gotchas vignette

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)

# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))

NestedData <- Data %>% 
  group_by(ID) %>% 
  nest() %>% 
  ungroup()

tictoc::tic()
xx <- mutate(NestedData, data2 = map(data, identity))
tictoc::toc()
#> 0.105 sec elapsed

tictoc::tic()
xx <- mutate(NestedData, data2 = future_map(data, identity))
tictoc::toc()
#> 9.069 sec elapsed

This is an acceptable overhead to me, because it has to shuffle the nested data frames to and from the workers.

In my computer, furrr is slower than purrr using the same code, I don't know why?

测试

Look closer at #234 (comment)

I'm already showing an example where furrr is slower. That's perfectly normal when you are sending over large datasets to each worker and then running an extremely cheap function on each one of them.

When doing parallel work, there can be large costs to sending "big" datasets over to the workers, which is not something that sequential evaluation has to do.

Also, in the future, we'd prefer if you open new issues rather than commenting on old ones. It is easier for us to keep track of!