interpolated integral/average with timezone offset fails for 1 month interval
phoenix147 opened this issue · 5 comments
Relevant system information:
- OS: timescaledb-ha:pg14.5-ts2.8.0-latest
- PostgreSQL version: 14.5
- TimescaleDB Toolkit version: 1.11.1
- Installation method: docker, kubernetes
Describe the bug
We're trying to run an interpolated_integral (or interpolated_average) query with an underlying subquery, which has a timezone offset:
SELECT bucket, toolkit_experimental.interpolated_integral(agg, bucket, '1month', LAG(agg) OVER (ORDER BY bucket), LEAD(agg) OVER (ORDER BY bucket), 'hours')
FROM (
SELECT time_bucket('1month', time, 'CET') AS bucket, time_weight('LOCF', time, value) as agg
FROM measurements WHERE variable = 1
AND time >= '2022-12-01 00:00:00'
AND time <= '2023-02-21 00:00:00' GROUP BY bucket)
temp ORDER BY bucket;
This works for an interval up to 1 day, but using 1 month, the following error occurs:
ERROR: assertion failed: interval_start <= self.first.ts &&\n interval_start + interval_len > self.last.ts CONTEXT: extension/src/time_weighted_average.rs:54:9 SQL state: XX000
The subquery results:
"2023-01-31 23:00:00+00" "(version:1,first:(ts:""2023-02-01 07:03:17.904652+00"",val:0.07),last:(ts:""2023-02-21 10:55:07.632907+00"",val:103.23),weighted_sum:16610132665429.502,method:LOCF)"
"2022-12-31 23:00:00+00" "(version:1,first:(ts:""2023-01-01 06:56:08.400243+00"",val:0.02),last:(ts:""2023-01-31 15:54:08.149332+00"",val:0),weighted_sum:10386759309142.107,method:LOCF)"
"2022-11-30 23:00:00+00" "(version:1,first:(ts:""2022-12-01 06:47:15.756845+00"",val:0.1),last:(ts:""2022-12-31 15:19:45.953033+00"",val:0),weighted_sum:7417551236871.949,method:LOCF)"
Using this timestamps and a 1 month interval only adds 30 days, which results in the error above
Removing the timezone from the query, results in the following set:
"2022-12-01 00:00:00+00" "(version:1,first:(ts:""2022-12-01 06:47:15.756845+00"",val:0.1),last:(ts:""2022-12-31 15:19:45.953033+00"",val:0),weighted_sum:7417551236871.949,method:LOCF)"
"2023-01-01 00:00:00+00" "(version:1,first:(ts:""2023-01-01 06:56:08.400243+00"",val:0.02),last:(ts:""2023-01-31 15:54:08.149332+00"",val:0),weighted_sum:10386759309142.107,method:LOCF)"
"2023-02-01 00:00:00+00" "(version:1,first:(ts:""2023-02-01 07:03:17.904652+00"",val:0.07),last:(ts:""2023-02-21 11:05:53.185773+00"",val:88.63),weighted_sum:16651660179395.787,method:LOCF)"
which works correctly.
Is there any way to work around this issue?
To Reproduce
Steps to reproduce the behavior:
Use a query above with time_bucket and timezone over several months (including months with 31 days)
Expected behavior
The query should run even when using time_bucket with timezone data
Actual behavior
The query fails with above error.
Oy. Time zones, and especially their interactions with months are truly the worst.
So, I did some digging, and I think what's going on here is that time_bucket('1 month', time, time zone)
returns a timestamptz of the start of the month at that time zone, which is correct, but if I add one month to that time stamp, now not knowing I'm working in the specified time zone, the logic doesn't work exactly the same way, so you end up with some weirdness - so what I'd do to fix that is create a quick function:
CREATE OR REPLACE FUNCTION calc_month_interval(day timestamptz, tz text) RETURNS INTERVAL as $$
SELECT ($1 at time zone $2) + interval '1 month' - ($1 at time zone $2) $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
That does the conversion to the correct time zone and then call it in your SQL statement:
SELECT bucket, calc_month_interval(bucket, 'CET'), toolkit_experimental.interpolated_integral(agg, bucket, calc_month_interval(bucket,'CET'), LAG(agg) OVER (ORDER BY bucket), LEAD(agg) OVER (ORDER BY bucket), 'hours')
FROM (
SELECT time_bucket('1month', time, 'CET') AS bucket, time_weight('LOCF', time, value) as agg
FROM measurements WHERE variable = 1
AND time >= '2022-12-01 00:00:00'
AND time <= '2023-02-21 00:00:00' GROUP BY bucket)
temp ORDER BY bucket;
This is a bit unfortunate and we may be able to create a function that helps with this...but it's a bit of a weird edge cases with the interactions of time zones, buckets, and months and oof. Good catch. Let us know if this helps!
David's solution here is much more robust, so I'm just going to delete my last comment.
This appears to be somewhat less robust and only work if your server time zone is set to 'UTC', I'm not 100% sure why yet, but we'll keep looking into this, it may be a bug around how we do time buckets with time zone offsets related in some way to: timescale/timescaledb#5334
The problem appears to be that we are using timestamptz_pl_interval
to compute the end time from the given start
and interval
parameters, and timestamptz_pl_interval
always uses the session timezone to add the months and days of the interval to the timestamptz. AFAICT there isn't any direct alternative to timestamptz_pl_interval
that allows specifying the timezone.
Since session configuration is transactional, another workaround is set the session timezone in a transaction, make your query, then rollback.
edit: or use SET LOCAL
within a transaction, then you don't even need to rollback it
Thanks for the workaround, it's working good.
We also experimented with using the simple interval/average functions like this
SELECT time_bucket('1 month', time) as dt, toolkit_experimental.integral(time_weight('LOCF', time, value), 'hours') AS tw
FROM measurements
WHERE variable = 1
AND time >= '2023-01-01'
AND time <= '2023-02-21'
GROUP BY dt
ORDER BY dt
but the results, especially with fine grained buckets (1m, 15m) were not as good as with the interpolation variants