reconverse/incidence2

version2 - dplyr for vctrs_rcrd and POSIXlt support

Closed this issue · 5 comments

Currently using only data.table for aggregation which means we cannot support non-atomic columns (see https://github.com/Rdatatable/data.table/issues?q=is%3Aopen+is%3Aissue+label%3A%22non-atomic+column%22 for further discussion). Previously I switched behaviour based on the input and dispatched to either data.table or dplyr accordingly. Need to think about whether I do the same thing going forward or error on known problematic inputs.

I want to avoid changing user inputs so if I do not go with the dual use I'd like to error on POSIXlt input .

Maybe support POSIXlt with a warning but error on vctrs_rcrd input???

Would it make sense to automatically convert POSIXlt to POSIXct with a warning?

Am planning to add dplyr support for this but at the same time I'm wondering about disallowing POSIXt objects.

My issue with POSIXt is that by looking at it you cannot tell what units someone is thinking in (e.g. datetime hh:00:00 could be someone thinking in seconds, minutes or hours). Without extra information passed around it's hard to tell. This makes me think that we whould push people to use alternatives. For instance if someone wanted hourly recording then either seperate day and hour groups may be appropriate or perhaps the use of something like {clock}. Code for reference to hopefully show what I'm meaning

(dat <- as.POSIXct("2020-01-01"))
#> [1] "2020-01-01 GMT"
dat + 1
#> [1] "2020-01-01 00:00:01 GMT"

(clock_dat <- 
    dat |> 
    clock::as_naive_time() |> 
    clock::time_point_cast("hour"))
#> <clock_naive_time[1]>
#> [1] "2020-01-01T00"

as.POSIXct(clock_dat + 1)
#> [1] "2020-01-01 01:00:00 GMT"

@thibautjombart - Feedback wanted as want to think carefully about this.

I would add an argument to relevant functions (I am thinking, maybe only the constructor, actually) allowing the user to specify the unit, and defaulting to 'day'. In practice, I have seen maybe once or twice in the last 10 years a need for handling sub-day units, and most people would want days. Also, I think it may still be the case that rio::import("a_linelist.xlsx") will store dates as POSIXct by default, where the hhmmss are meaningless and not part of the original data. Maybe something along the lines of:

## most common use-case
source_file |>
  rio::import() |> 
  incidence(date_index = "date_onset") # POSIXct where units = days

## less common use-case
source_file |>
  rio::import() |> 
  incidence(dt_admission = "date__time_admission", posix_unit = "hour") # POSIX with units = hours

I am even thinking of converting POSIXct with days as units to Date, which could be controlled using an argument simplify_posix = TRUE by default.

Currently not going to support.