DavisVaughan/furrr

Add easy function for grouped df processing

Closed this issue · 2 comments

Performing grouped df processing is an excellent opportunity for parallelization, but can be cumbersome to think about and do for a basic user. However, the groupings allow an opportunity to automate the splitting by workers and thus abstract the parallelization from the users.

Here is a (non-optimized) function I wrote that makes it easy to take normal df processing calls and parallelize them by number of workers.

parallelized_df_do <- function(.data, .f, ..., n_workers=NULL) {
  f(is.null(n_workers)) n_workers <- future::nbrOfWorkers()
  
  keys_tbl <- .data %>% 
    dplyr::group_keys() %>% 
    dplyr::mutate(.worker_id = sample(1:n_workers, replace=T, size=nrow(.)))
  
  .data %>% 
    dplyr::ungroup %>% 
    dplyr::left_join(keys_tbl, by = dplyr::group_vars(.data)) %>% 
    dplyr::group_split(.worker_id, .keep=F) %>% 
    map(~dplyr::group_by(.x, !!!dplyr::groups(.data))) %>% 
    furrr::future_map_dfr(.f, ...)
}

# An example
iris %>% 
  group_by(Petal.Length, Species) %>% 
  parallelized_df_do(~summarize(.x, across(starts_with("Sepal"), sum)))

furrr is extremely focused on only exposing parallel versions of the purrr API, so I don't think this is a good fit for this package.

Have you seen multidplyr? It attempts to do some of this. https://github.com/tidyverse/multidplyr

Thanks @DavisVaughan.
I hadn't heard about multidplyr, and I wasn't sure where this particular utility function should go. I agree that it's outside the traditional scope of furrr. And I see some of it (if not all) has already been implemented in multidplyr.

Thanks for pointing me in that direction!