thicken slow because of character conversion?
Closed this issue · 6 comments
Hey @EdwinTh, I'm a big fan of the package. I had a quick question. It looks like thicken()
might be taking longer than it needs to for objects that inherit from POSIXt
. After doing some profiling, I noticed that the function posix_to_date()
in round_thicken()
attempts to convert a POSIXt
object to a Date
object (if it can) after the thickening. This is great, but it does so by first converting to character, then to date, see this line.
Is the conversion to character necessary? I removed it and ran some speed tests. See below for the potential gains without that one conversion. A ~10x speed gain isn't bad for the 100k row dataset.
With the conversion
library(microbenchmark)
library(padr)
suppressPackageStartupMessages(library(dplyr))
microbenchmark::microbenchmark(
thicken(emergency, "day"),
times = 5
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> thicken(emergency, "day") 988.6313 1037.317 1078.123 1066.518 1081.611
#> max neval
#> 1216.539 5
Without the as.character()
conversion.
library(microbenchmark)
library(padr)
suppressPackageStartupMessages(library(dplyr))
microbenchmark::microbenchmark(
thicken(emergency, "day"),
times = 5
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> thicken(emergency, "day") 131.0412 133.329 189.0237 133.4806 148.5445
#> max neval
#> 398.7233 5
Thanks @DavisVaughan, I worked hard on improving thicken’s performance for v0.4, but this I definitely overlooked. Its still in time for the release, will adjust it as soon as I find the time.
I retrieved why I added this (should have made a comment). Strangely enough, as.Date
will set the output of thickening to the wrong date. (See the first unit tests of test_thicken_integration
). This might be due to timezones, I still need to study this. Apparently, I did a quick and dirty workaround, but this thus gives performance loss. We should dig around more to find a better performing solution.
Which tests in particular? I'm pretty sure I just ran all the test_thicken_integration
tests with no issues after removing the as.character()
. I wouldn't doubt that some work around might be needed, but I'm not seeing it so far.
Thats what I thought, you have US locale, I have CET. This is so tricky to unit test properly. My guess is that it takes a central time (like UTC) and determines the day relative to this time zone for your own time zone. I will expand the unit tests soon, so it includes multiple time zones. We can than find a better performing solution.
as.Date()
has an S3 method specifically for POSIXct
conversion. It supports a tz
argument.
x <- as.Date(x, tz = tz(x))
This works! I changed my Sys.setenv(TZ = "CET")
so that it would match yours. I confirmed the error with just using as.Date(x)
. With the above change, I think it works as expected.
Time zones are the bane of my existence.
Beautiful, thanks for the search.
And yes time zones should be abandoned!