parsing year-quarter formats
TimTaylor opened this issue · 3 comments
lubridate's custom parser allows us to parse quarters with the %q
format; do you think this functionality is something that will be added to clock? Currently I would pre-process in base R and then handle with clock but the string manipulation overhead does become noticeable for larger vectors compared to the C level parser utilised by lubridate. Example of functionality below:
options(lubridate.verbose = TRUE)
dat <- "2021q2"
lubridate::yq(dat)
#> 1 parsed with %Yq%q
#> [1] "2021-04-01"
lubridate::fast_strptime(dat, "%Yq%q")
#> [1] "2021-04-01 UTC"
Created on 2022-08-11 by the reprex package (v2.0.1)
Yea I'd like to add clock::year_quarter_day_parse()
(like year_month_day_parse()
) that would allow you to handle this, which you could then convert to date/posixct with as_date()
or as_date_time()
I imagine this is probably the fastest way in the meantime
library(clock)
library(stringr)
dat <- c("2021q2", "2021q3")
dat <- str_split_fixed(dat, "q", 2)
dat
#> [,1] [,2]
#> [1,] "2021" "2"
#> [2,] "2021" "3"
year <- as.integer(dat[, 1, drop = TRUE])
quarter <- as.integer(dat[, 2, drop = TRUE])
yq <- year_quarter_day(year, quarter)
yq
#> <year_quarter_day<January><quarter>[2]>
#> [1] "2021-Q2" "2021-Q3"
# Then if you need Date
as_date(set_day(yq, 1))
#> [1] "2021-04-01" "2021-07-01"
That method took 1.5 seconds with 2 million strings
Cool - year_quarter_day_parse()
would be great to have. For completeness, and comparison with yq()
, here's the closest I think we can currently get without the additional parser (main difference from above being stringi over stringr):
library(stringi)
library(lubridate, include.only = "yq")
library(clock)
library(microbenchmark)
n <- 2000000L
yrs <- rep_len(1022:2022, n)
qtrs <- rep_len(1:4, n)
input <- sprintf("%dq%d", yrs, qtrs)
clocky <- function(x) {
x <- stri_split_fixed(x, "q", n = 2L, simplify = TRUE)
storage.mode(x) <- "integer"
x <- year_quarter_day(x[,1L], x[,2L], 1L)
as_date(x)
}
microbenchmark(yq(input), clocky(input), check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> yq(input) 162.3977 183.7156 208.0677 205.3677 224.6783 319.2763 100
#> clocky(input) 701.2078 760.9052 849.4119 870.0091 935.9795 997.0503 100