/populate

Safe Wrappers around 'dplyr::mutate()'

Primary LanguageR

populate

Draft for discussion.

This issue was closed : tidyverse/dplyr#6547

Is this extension package useful ?

{populate} provides stricter wrappers around dplyr::mutate():

  • populate() makes sure the input’s ptype is preserved
    • no column type change
    • no column addition
    • no column deletion (mutate() can delete with NULL)
    • cast by default, assert optionally
  • collate() only creates new columns

Do we need more or should we keep it simple?

  • abrogate() to remove columns ?
  • sublimate() like populate() but allow ptype change (doesn’t allow new col creation) ?
  • summarize() variants ?

Is it better to use a subclass of data frame (as in linked github issue) so we can make safe versions of everything including functions like count(), unnest(), joins …

We’d lose autocomplete of new args, maybe more confusing than having new verbs ?

Installation

You can install the development version of populate like so:

devtools::install_github("cynkra/populate")

Examples

library(populate)

data <- tibble::tibble(
  a = letters[1:2],
  b = c(1,2), 
  c = factor(letters[1:2]), 
  d = as.Date(c("2022-01-01", "2022-01-02")), 
  e = vctrs::list_of(cars)
)

# can't create a column if it exists
collate(data, a = 1) 
#> Error in `collate()`:
#> ! `collate()` can't override existing columns.
#> ℹ Found in data: `a`.
#> ℹ Do you need `populate()`?

# but we can create new columns
collate(data, ee = 1) 
#> # A tibble: 2 × 6
#>   a         b c     d                       e    ee
#>   <chr> <dbl> <fct> <date>     <list<df[,2]>> <dbl>
#> 1 a         1 a     2022-01-01       [50 × 2]     1
#> 2 b         2 b     2022-01-02       [50 × 2]     1

# we can't create a new column with populate()
populate(data, ee = 1) 
#> Error in `populate()`:
#> ! `populate()` can't create new columns.
#> ✖ Not found in data: `ee`.
#> ℹ Did you mispell `e`?
#> ℹ Do you need `collate()`?

# can't cast character to double
populate(data, a = 1)
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `a = vctrs::vec_cast(1, vctrs::vec_ptype(a))`.
#> Caused by error:
#> ! Can't convert `1` <double> to <character>.

# casting integer to double
populate(data, b = 3:4) 
#> # A tibble: 2 × 5
#>   a         b c     d                       e
#>   <chr> <dbl> <fct> <date>     <list<df[,2]>>
#> 1 a         3 a     2022-01-01       [50 × 2]
#> 2 b         4 b     2022-01-02       [50 × 2]

# doesn't work if `.strict` is `TRUE`
populate(data, b = 3:4, .strict = TRUE) 
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `b = vctrs::vec_assert(3:4,
#>   vctrs::vec_ptype(b))`.
#> Caused by error:
#> ! `3:4` must be a vector with type <double>.
#> Instead, it has type <integer>.

# casting character to factor with allowed levels
populate(data, c = c("b", "b")) 
#> # A tibble: 2 × 5
#>   a         b c     d                       e
#>   <chr> <dbl> <fct> <date>     <list<df[,2]>>
#> 1 a         1 b     2022-01-01       [50 × 2]
#> 2 b         2 b     2022-01-02       [50 × 2]

# can't cast because wrong levels
populate(data, c = c("b", "d")) 
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `c = vctrs::vec_cast(c("b", "d"),
#>   vctrs::vec_ptype(c))`.
#> Caused by error:
#> ! Can't convert from `c("b", "d")` <character> to <factor<38051>> due to loss of generality.
#> • Locations: 2

# datetimes are casted to date
populate(data, d = lubridate::as_datetime(c("2022-01-01", "2022-01-02"))) 
#> # A tibble: 2 × 5
#>   a         b c     d                       e
#>   <chr> <dbl> <fct> <date>     <list<df[,2]>>
#> 1 a         1 a     2022-01-01       [50 × 2]
#> 2 b         2 b     2022-01-02       [50 × 2]

 # characters can't be casted to date
populate(data, d = c("2022-01-01", "2022-01-02"))
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `d = vctrs::vec_cast(c("2022-01-01",
#>   "2022-01-02"), vctrs::vec_ptype(d))`.
#> Caused by error:
#> ! Can't convert `c("2022-01-01", "2022-01-02")` <character> to <date>.

# using list_of allowed us to prevent corrupting our data silently
populate(data, e = list(iris)) 
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `e = vctrs::vec_cast(list(iris),
#>   vctrs::vec_ptype(e))`.
#> Caused by error:
#> ! Can't convert `list(iris)` <list> to <list_of<
#>   data.frame<
#>     speed: double
#>     dist : double
#>   >
#> >>.

# and we don't have to bother with list_of anymore if we feed the right format
populate(data, e = list(head(cars))) 
#> Error in `populate()`:
#> ! Can't `populate()` the data.
#> Caused by error in `dplyr::mutate()`:
#> ! Problem while computing `e = vctrs::vec_cast(list(head(cars)),
#>   vctrs::vec_ptype(e))`.
#> Caused by error:
#> ! Can't convert `list(head(cars))` <list> to <list_of<
#>   data.frame<
#>     speed: double
#>     dist : double
#>   >
#> >>.