/tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats

Primary LanguageR

tidyfast v0.2.0

Lifecycle: maturing Travis build status Codecov test coverage

The goal of tidyfast is to provide fast and efficient alternatives to some tidyr (and a few dplyr) functions using data.table under the hood. Each have the prefix of dt_ to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr (but notably does not use the lazy_dt() framework of dtplyr). This package imports data.table and Rcpp (no other dependencies).

These are, in essence, translations from a more tidyverse grammar to data.table. Most functions herein are in places where, in my opinion, the data.table syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.

The current functions include:

Nesting and unnesting (similar to dplyr::group_nest() and tidyr::unnest()):

  • dt_nest() for nesting data tables
  • dt_unnest() for unnesting data tables
  • dt_hoist() for unnesting vectors in a list-column in a data table

Pivoting (similar to tidyr::pivot_longer() and tidyr::pivot_wider())

  • dt_pivot_longer() for fast pivoting using data.table::melt()
  • dt_pivot_wider() for fast pivoting using data.table::dcast()

If Else (similar to dplyr::case_when()):

  • dt_case_when() for dplyr::case_when() syntax with the speed of data.table::fifelse()

Fill (similar to tidyr::fill())

  • dt_fill() for filling NA values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in NA values with values within an individual).

Count and Uncount (similar to tidyr::uncount() and dplyr::count())

  • dt_count() for fast counting by group(s)
  • dt_uncount() for creating full data from a count table

Separate (similar to tidyr::separate())

  • dt_separate() for splitting a single column into multiple based on a match within the column (e.g., column with values like “A.B” could be split into two columns by using the period as the separator where column 1 would have “A” and 2 would have “B”). It is built on data.table::tstrsplit(). This is not well tested yet and lacks some functionality of tidyr::separate().

Adjust data.table print options

  • dt_print_options() for adjusting the options for print.data.table()

General API

tidyfast attempts to convert syntax from tidyr with its accompanying grammar to data.table function calls. As such, we have tried to maintain the tidyr syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in tidyr may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.

Each function that takes data (labeled as dt_ in the package docs) as its first argument automatically coerces it to a data table with as.data.table() if it isn’t already a data table. Each of these functions will return a data table.

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("TysonStanley/tidyfast")

Examples

The initial versions of the nesting and unnesting functions were shown in a preprint. Herein is shown some simple applications and the functions’ speed/efficiency.

Nesting and Unnesting

The following data table will be used for the nesting/unnesting examples.

set.seed(84322)

library(tidyfast)
library(data.table)
library(dplyr)       # to compare with case_when()
library(tidyr)       # to compare with fill() and separate()
library(ggplot2)     # figures
library(ggbeeswarm)  # figures

dt <- data.table(
   x = rnorm(1e5),
   y = runif(1e5),
   grp = sample(1L:5L, 1e5, replace = TRUE),
   nested1 = lapply(1:10, sample, 10, replace = TRUE),
   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
   id = 1:1e5)

To make all the comparisons herein more equal, we will set the number of threads that data.table will use to 1.

setDTthreads(1)

We can nest this data using dt_nest():

nested <- dt_nest(dt, grp)
nested
#>    grp         data
#> 1:   1 <data.table>
#> 2:   2 <data.table>
#> 3:   3 <data.table>
#> 4:   4 <data.table>
#> 5:   5 <data.table>

We can also unnest this with dt_unnest():

dt_unnest(nested, col = data)
#>         grp          x           y               nested1
#>      1:   1 -1.1813164 0.004599736       2,2,1,2,1,1,...
#>      2:   1 -1.0384420 0.853208540       2,8,4,6,7,7,...
#>      3:   1 -0.6247028 0.072652533       4,2,2,1,1,1,...
#>      4:   1 -1.3651514 0.569079215       1,1,1,3,6,2,...
#>      5:   1  0.1403744 0.864617284 10, 1, 1, 1, 8, 1,...
#>     ---                                                 
#>  99996:   5 -0.3437795 0.995197776       2,1,2,2,2,1,...
#>  99997:   5  1.6157744 0.241735719 10, 1, 1, 1, 8, 1,...
#>  99998:   5 -0.1321246 0.885283934       2,3,3,2,2,4,...
#>  99999:   5 -1.7019715 0.524621296       5,4,3,3,3,2,...
#> 100000:   5  0.3821493 0.032851280       2,8,4,6,7,7,...
#>                                               nested2    id
#>      1: thing2,thing2,thing2,thing2,thing2,thing2,...     2
#>      2: thing2,thing2,thing2,thing2,thing2,thing2,...     8
#>      3: thing1,thing1,thing1,thing1,thing1,thing1,...    15
#>      4: thing1,thing1,thing1,thing1,thing1,thing1,...    17
#>      5: thing2,thing2,thing2,thing2,thing2,thing2,...    20
#>     ---                                                    
#>  99996: thing1,thing1,thing1,thing1,thing1,thing1,... 99983
#>  99997: thing2,thing2,thing2,thing2,thing2,thing2,... 99990
#>  99998: thing2,thing2,thing2,thing2,thing2,thing2,... 99994
#>  99999: thing2,thing2,thing2,thing2,thing2,thing2,... 99996
#> 100000: thing2,thing2,thing2,thing2,thing2,thing2,... 99998

When our list columns don’t have data tables (as output from dt_nest()) we can use the dt_hoist() function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.

dt_hoist(dt, nested1, nested2)
#> The following columns were dropped because they are list-columns (but not being hoisted):
#>                  x         y grp     id nested1 nested2
#>       1: 0.1720703 0.3376675   2      1       1  thing1
#>       2: 0.1720703 0.3376675   2      1       1  thing1
#>       3: 0.1720703 0.3376675   2      1       1  thing1
#>       4: 0.1720703 0.3376675   2      1       1  thing1
#>       5: 0.1720703 0.3376675   2      1       1  thing1
#>      ---                                               
#>  999996: 0.6268181 0.7851774   1 100000       1  thing2
#>  999997: 0.6268181 0.7851774   1 100000       5  thing2
#>  999998: 0.6268181 0.7851774   1 100000       7  thing2
#>  999999: 0.6268181 0.7851774   1 100000       6  thing2
#> 1000000: 0.6268181 0.7851774   1 100000       7  thing2

Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the nested1 and nested2 columns of the original dt object from above. Also, all dplyr and tidyr functions use a tbl version of the dt table.

#> # A tibble: 2 x 3
#>   expression   median mem_alloc
#>   <chr>      <bch:tm> <bch:byt>
#> 1 dt_nest      3.16ms    2.88MB
#> 2 group_nest   4.79ms    2.54MB
#> # A tibble: 2 x 3
#>   expression   median mem_alloc
#>   <chr>      <bch:tm> <bch:byt>
#> 1 dt_unnest     4.7ms    5.49MB
#> 2 unnest       12.6ms    8.47MB

Pivoting

Finally, thanks to [@mtfairbanks](https://github.com/mtfairbanks), we now have pivoting translations to data.table::melt() and data.table::dcast(). Consider the following example (similar to the example in tidyr::pivot_longer() and tidyr::pivot_wider()):

billboard <- tidyr::billboard

# note the warning - melt is telling us what 
#   it did with the various data types---logical (where there were just NAs
#   and numeric
longer <- billboard %>%
  dt_pivot_longer(
     cols = c(-artist, -track, -date.entered),
     names_to = "week",
     values_to = "rank"
  )
#> Warning in melt.data.table(data = dt_, id.vars = id_vars, measure.vars =
#> cols, : 'measure.vars' [wk1, wk2, wk3, wk4, ...] are not all of the same
#> type. By order of hierarchy, the molten data value column will be of type
#> 'double'. All measure variables not of type 'double' will be coerced too.
#> Check DETAILS in ?melt.data.table for more on coercion.
longer
#>                  artist                   track date.entered week rank
#>     1:            2 Pac Baby Don't Cry (Keep...   2000-02-26  wk1   87
#>     2:          2Ge+her The Hardest Part Of ...   2000-09-02  wk1   91
#>     3:     3 Doors Down              Kryptonite   2000-04-08  wk1   81
#>     4:     3 Doors Down                   Loser   2000-10-21  wk1   76
#>     5:         504 Boyz           Wobble Wobble   2000-04-15  wk1   57
#>    ---                                                                
#> 24088:      Yankee Grey    Another Nine Minutes   2000-04-29 wk76   NA
#> 24089: Yearwood, Trisha         Real Live Woman   2000-04-01 wk76   NA
#> 24090:  Ying Yang Twins Whistle While You Tw...   2000-03-18 wk76   NA
#> 24091:    Zombie Nation           Kernkraft 400   2000-09-02 wk76   NA
#> 24092:  matchbox twenty                    Bent   2000-04-29 wk76   NA

wider <- longer %>% 
  dt_pivot_wider(
    names_from = week,
    values_from = rank
  )
wider[, .(artist, track, wk1, wk2)]
#>                artist                   track wk1 wk2
#>   1:            2 Pac Baby Don't Cry (Keep...  87  82
#>   2:          2Ge+her The Hardest Part Of ...  91  87
#>   3:     3 Doors Down              Kryptonite  81  70
#>   4:     3 Doors Down                   Loser  76  76
#>   5:         504 Boyz           Wobble Wobble  57  34
#>  ---                                                 
#> 313:      Yankee Grey    Another Nine Minutes  86  83
#> 314: Yearwood, Trisha         Real Live Woman  85  83
#> 315:  Ying Yang Twins Whistle While You Tw...  95  94
#> 316:    Zombie Nation           Kernkraft 400  99  99
#> 317:  matchbox twenty                    Bent  60  37

Notably, there are some current limitations to these: 1) tidyselect techniques do not work across the board (e.g. cannot use start_with() and friends) and 2) the functions are new and likely prone to edge-case bugs.

But let’s compare some basic speed and efficiency. Note that the figures are in log-base-10 scale. Because of the data.table functions, these are extremely fast and efficient.

#> # A tibble: 4 x 3
#>   expression        median mem_alloc
#>   <chr>           <bch:tm> <bch:byt>
#> 1 dt_pivot_longer   1.01ms  993.47KB
#> 2 pivot_longer      6.29ms    2.63MB
#> 3 dt_pivot_wider   10.78ms    1.86MB
#> 4 pivot_wider     361.81ms    2.43MB

If Else

Also, the new dt_case_when() function is built on the very fast data.table::fiflese() but has syntax like unto dplyr::case_when(). That is, it looks like:

dt_case_when(condition1 ~ label1,
             condition2 ~ label2,
             ...)

To show that each method, dt_case_when(), dplyr::case_when(), and data.table::fifelse() produce the same result, consider the following example.

x <- rnorm(1e6)

medianx <- median(x)
x_cat <-
  dt_case_when(x < medianx ~ "low",
               x >= medianx ~ "high",
               is.na(x) ~ "unknown")
x_cat_dplyr <-
  case_when(x < medianx ~ "low",
            x >= medianx ~ "high",
            is.na(x) ~ "unknown")
x_cat_fif <-
  fifelse(x < medianx, "low",
  fifelse(x >= medianx, "high",
  fifelse(is.na(x), "unknown", NA_character_)))

identical(x_cat, x_cat_dplyr)
#> [1] TRUE
identical(x_cat, x_cat_fif)
#> [1] TRUE

Notably, dt_case_when() is very fast and memory efficient, given it is built on data.table::fifelse().

#> # A tibble: 3 x 3
#>   expression     median mem_alloc
#>   <chr>        <bch:tm> <bch:byt>
#> 1 case_when     123.8ms   148.8MB
#> 2 dt_case_when   32.5ms    34.3MB
#> 3 fifelse          34ms    34.3MB

Fill

A new function is dt_fill(), which fulfills the role of tidyr::fill() to fill in NA values with values around it (either the value above, below, or trying both). This currently relies on the efficient C++ code from tidyr (fillUp() and fillDown()).

x = 1:10
dt_with_nas <- data.table(
  x = x,
  y = shift(x, 2L),
  z = shift(x, -2L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE))

# Original
dt_with_nas
#>      x  y  z  a id
#>  1:  1 NA  3 NA  3
#>  2:  2 NA  4  9  3
#>  3:  3  1  5 NA  1
#>  4:  4  2  6  8  3
#>  5:  5  3  7 NA  2
#>  6:  6  4  8 NA  2
#>  7:  7  5  9  7  3
#>  8:  8  6 10 NA  2
#>  9:  9  7 NA NA  2
#> 10: 10  8 NA  4  2

# All defaults
dt_fill(dt_with_nas, y, z, a)
#>      y  z  a
#>  1: NA  3 NA
#>  2: NA  4  9
#>  3:  1  5  9
#>  4:  2  6  8
#>  5:  3  7  8
#>  6:  4  8  8
#>  7:  5  9  7
#>  8:  6 10  7
#>  9:  7 10  7
#> 10:  8 10  4

# by id variable called `grp`
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id))
#>     id  y  z  a
#>  1:  3 NA  3 NA
#>  2:  3 NA  4  9
#>  3:  3  2  6  8
#>  4:  3  5  9  7
#>  5:  1  1  5 NA
#>  6:  2  3  7 NA
#>  7:  2  4  8 NA
#>  8:  2  6 10 NA
#>  9:  2  7 10 NA
#> 10:  2  8 10  4

# both down and then up filling by group
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id), 
        .direction = "downup")
#>     id y  z  a
#>  1:  3 2  3  9
#>  2:  3 2  4  9
#>  3:  3 2  6  8
#>  4:  3 5  9  7
#>  5:  1 1  5 NA
#>  6:  2 3  7  4
#>  7:  2 4  8  4
#>  8:  2 6 10  4
#>  9:  2 7 10  4
#> 10:  2 8 10  4

In its current form, dt_fill() is faster than tidyr::fill() and uses slightly less memory. Below are the results of filling in the NAs within each id on a 19 MB data set.

x = 1:1e6
dt3 <- data.table(
  x = x,
  y = shift(x, 10L),
  z = shift(x, -10L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE))
df3 <- data.frame(dt3)

marks3 <-
  bench::mark(
    tidyr::fill(dplyr::group_by(df3, id), x, y),
    tidyfast::dt_fill(dt3, x, y, id = list(id)),
    check = FALSE,
    iterations = 50
  )

#> # A tibble: 2 x 3
#>   expression                                    median mem_alloc
#>   <bch:expr>                                  <bch:tm> <bch:byt>
#> 1 tidyr::fill(dplyr::group_by(df3, id), x, y)   62.5ms    30.7MB
#> 2 tidyfast::dt_fill(dt3, x, y, id = list(id))   22.8ms    29.1MB

Separate

The dt_separate() function is still under heavy development. Its behavior is similar to tidyr::separate() but is lacking some functionality currently. For example, into needs to be supplied the maximum number of possible columns to separate.

dt_separate(data.table(col = "A.B.C"), col, into = c("A", "B"))
#> Error in `[.data.table`(dt, , eval(split_it)) : 
#>   Supplied 2 columns to be assigned 3 items. Please see NEWS for v1.12.2.

For current functionality, consider the following example.

dt_to_split <- data.table(
  x = paste(letters, LETTERS, sep = ".")
)

dt_separate(dt_to_split, x, into = c("lower", "upper"))
#>    lower upper
#> 1:     a     A
#> 2:     b     B
#> 3:     c     C
#> 4:     d     D
#> 5:     e     E
#> 6:     f     F

Testing with a 4 MB data set with one variable that has columns of “A.B” repeatedly, shows that dt_separate() is fast but less memory efficient than tidyr::separate().

#> # A tibble: 3 x 3
#>   expression            median mem_alloc
#>   <chr>               <bch:tm> <bch:byt>
#> 1 separate               358ms    11.6MB
#> 2 dt_separate            113ms    30.6MB
#> 3 dt_separate-mutable    108ms    26.7MB

Count and Uncount

The dt_count() function does essentially what dplyr::count() does. Notably, this, unlike the majority of other dt_ functions, wraps a very simple statement in data.table. That is, data.table makes getting counts very simple and concise. Nonetheless, dt_count() fits the general API of tidyfast. To some degree, dt_uncount() is also a fairly simple wrapper, although the approach may not be as straightforward as that for dt_count().

The following examples show how count and uncount can work. We’ll use the dt data table from the nesting examples.

counted <- dt_count(dt, grp)
counted
#>    grp     N
#> 1:   1 19638
#> 2:   2 19987
#> 3:   3 20033
#> 4:   4 20269
#> 5:   5 20073
uncounted <- dt_uncount(counted, N)
print(uncounted)
#>         grp
#>      1:   1
#>      2:   1
#>      3:   1
#>      4:   1
#>      5:   1
#>     ---    
#>  99996:   5
#>  99997:   5
#>  99998:   5
#>  99999:   5
#> 100000:   5

These are also quick (not that the tidyverse functions were at all slow here).

dt5 <- copy(dt)
df5 <- data.frame(dt5)

marks5 <-
  bench::mark(
    counted_tbl <- dplyr::count(df5, grp),
    counted_dt <- tidyfast::dt_count(dt5, grp),
    tidyr::uncount(counted_tbl, n),
    tidyfast::dt_uncount(counted_dt, N),
    check = FALSE,
    iterations = 25
  )

Notes

Please note that the tidyfast project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

We want to thank our wonderful contributors:

  • mtfairbanks for PR #6 providing initial the pivoting functions. Note the gdt package that compliments some of tidyfasts functionality.

Complementary Packages: