reconverse/incidence2

Should `complete_dates()` use a different default `by` argument for POSIXct dates?

Closed this issue · 7 comments

Reprex (warning: this is likely to make your R session, and possibly other open programs, crash)

x <- data.frame(
    dates = Sys.Date() + c(1:3000),
    groups = c("grp1","grp2", "grp1"),
    counts = 1:3000
)
x$dates <- as.POSIXct(x$dates)

i <- incidence(x, date_index = "dates", groups = "groups", counts = "counts")
complete_dates(i)

The problem is here:

if (expand)
dates <- seq(min(dates), max(dates), by = by)

by = 1 for POSIXct creates one value per second. I understand this is not an actual bug but this should be expected behaviour. However, this is probably not what is intended in most cases. I would suggest by = "1 day" as a safer default.

I expect this to be somewhat common as it looks like readxl (and by extension rio) convert all date columns to dttm/POSIXct.

I need to think about this carefully. Quickly documenting some thoughts below and would appreciate your thoughts @Bisaloo:

POSIXct are a very bad input type to use if a user only wants to monitor daily occurrences. Unless they are interested in per second occurrences they can easily mess up anything involving aggregation. If day precision is desired, either the input should first be converted to a date object or the interval specification can be used.

I considered not supporting POSIXct at all but decided to allow them in the end as they are commonly found. I'm not keen to change this now but I wonder if a warning upon <incidence2> object creation, and a link to some documentation, explaining why the user may prefer a different input type would be worthwhile.

Another thing I could consider is an error if complete_dates() is called with POSIXct dates and by = 1 unless an argument if set to TRUE (e.g. something like force_seconds = FALSE for the default).

If precision less than a day is something you think users need/want, we would likely have to reintroduce support for {clock} (and also bring in {dplyr} to do the aggregations) as that is one of the packages I know that can do sub day precision really well.

Thoughts appreciated.

Also relevant are related thoughts from @thibautjombart at the tail end of #88

I do agree though, most users will be expecting to be working 'by day'. Need to balance encouraging people to use dates where appropriate versus what most people would expect as reasonable defaults. 🤔

POSIXct are a very bad input type to use if a user only wants to monitor daily occurrences.

Yes, fully agreed but it looks like they're easily produced by rio. I considered opening an issue there or in readxl since the original excel file contains a date and not a datetime but I'm not completely sure it's something they can easily change either.

I considered not supporting POSIXct at all but decided to allow them in the end as they are commonly found. I'm not keen to change this now but I wonder if a warning upon object creation, and a link to some documentation, explaining why the user may prefer a different input type would be worthwhile.

Yes, if you consider they are likely to cause issues when using incidence2, I think either an error or a warning would be very helpful to communicate it to users. I believe it's quite easy to miss the fact that you're using POSIXt instead of a Date if you just trust your import package and don't carefully check the output.

Another thing I could consider is an error if complete_dates() is called with POSIXct dates and by = 1 unless an argument if set to TRUE (e.g. something like force_seconds = FALSE for the default).

I don't think it's worth the hassle in terms of added complexity for users and increased maintenance load.

I do agree though, most users will be expecting to be working 'by day'.

Yes, my point with this suggestion is that, as far as I can tell, changing by = 1 to by = "1 day" won't have any impact on users with Dates or grates objects, but it could really save users with POSIXct from shooting themselves in the foot.

Yes, my point with this suggestion is that, as far as I can tell, changing by = 1 to by = "1 day" won't have any impact on users with Dates or grates objects, but it could really save users with POSIXct from shooting themselves in the foot.

Unfortunately, the way the code is currently written, the by argument shouldn't really be exposed to users at all as it only really makes sense to be 1. Roughly speaking the function calculates a data frame spanning seq(from = min(date_index), to = maxdate_index), by =1) and joins the original data on to this.

For by values not equal to 1, I should really check that all of the input dates are covered by the generated sequence and error if not. Will raise this as a bug before changing any defaults.

@Bisaloo - What do you think of the following changes (experimenting in the branch https://github.com/reconverse/incidence2/tree/POSIXct)

library(incidence2)
#> Loading required package: grates

x <- data.frame(
    dates = Sys.Date() + c(1:3),
    groups = c("grp1","grp2", "grp1"),
    counts = 1:3
)
x$dates <- as.POSIXct(x$dates)

Creating an incidence object like this will now give a warning

i <- incidence(x, date_index = "dates", groups = "groups", counts = "counts")
#> Warning in incidence(): <POSIXct> date_index columns detected. Internally
#> <POSIXct> objects are represented as seconds since the UNIX epoch and, in our
#> experience, this level of granularity is not normally desired for aggregation.
#> For daily incidence consider converting the inputs to <Dates>. This can be done
#> prior to calling `incidence()` or, alternatively, by setting the argument
#> `interval = 'day'` within the call itself.

Calling complete_dates will now error, but with an option for an explicit opt out (allow_POSIXct)

complete_dates(i)
#> Error in complete_dates(): <POSIXct> date_index columns detected.
#> Internally <POSIXct> objects are represented as seconds since the UNIX epoch
#> and calling `complete_dates()` on an object of granularity can lead to significant
#> memory usage. If you are sure you wish to do this, please call again with the
#> argument `allow_POSIXct` set to `TRUE`. If this level of aggregation is not desired,
#> consider recreating the <incidence> object using <Dates> for daily incidence.
#> This can be done prior to calling `incidence()` or, alternatively, by setting the
#> argument `interval = 'day'` within the call itself.

I've added now added a day interval to ease conversion if desired

(i2 <- incidence(x, date_index = "dates", groups = "groups", counts = "counts", interval = "day"))
#> # incidence:  3 x 4
#> # count vars: counts
#> # groups:     groups
#>   date_index groups count_variable count
#> * <date>     <chr>  <fct>          <int>
#> 1 2023-05-31 grp1   counts             1
#> 2 2023-06-01 grp2   counts             2
#> 3 2023-06-02 grp1   counts             3
complete_dates(i2)
#> # incidence:  6 x 4
#> # count vars: counts
#> # groups:     groups
#>   date_index groups count_variable count
#>   <date>     <chr>  <fct>          <int>
#> 1 2023-05-31 grp1   counts             1
#> 2 2023-05-31 grp2   counts             0
#> 3 2023-06-01 grp1   counts             0
#> 4 2023-06-01 grp2   counts             2
#> 5 2023-06-02 grp1   counts             3
#> 6 2023-06-02 grp2   counts             0

Created on 2023-05-30 with reprex v2.0.2

closed by dc8a315