timescale/timescaledb-toolkit

interpolated integral/average with timezone offset fails for 1 month interval

phoenix147 opened this issue · 5 comments

Relevant system information:

  • OS: timescaledb-ha:pg14.5-ts2.8.0-latest
  • PostgreSQL version: 14.5
  • TimescaleDB Toolkit version: 1.11.1
  • Installation method: docker, kubernetes

Describe the bug

We're trying to run an interpolated_integral (or interpolated_average) query with an underlying subquery, which has a timezone offset:

SELECT bucket, toolkit_experimental.interpolated_integral(agg, bucket, '1month', LAG(agg) OVER (ORDER BY bucket), LEAD(agg) OVER (ORDER BY bucket), 'hours')
FROM (
    SELECT time_bucket('1month', time, 'CET') AS bucket, time_weight('LOCF', time, value) as agg
    FROM measurements WHERE variable = 1 
    AND time >= '2022-12-01 00:00:00' 
    AND time <= '2023-02-21 00:00:00' GROUP BY bucket)
temp ORDER BY bucket;

This works for an interval up to 1 day, but using 1 month, the following error occurs:

ERROR: assertion failed: interval_start <= self.first.ts &&\n interval_start + interval_len > self.last.ts CONTEXT: extension/src/time_weighted_average.rs:54:9 SQL state: XX000

The subquery results:

"2023-01-31 23:00:00+00"	"(version:1,first:(ts:""2023-02-01 07:03:17.904652+00"",val:0.07),last:(ts:""2023-02-21 10:55:07.632907+00"",val:103.23),weighted_sum:16610132665429.502,method:LOCF)"
"2022-12-31 23:00:00+00"	"(version:1,first:(ts:""2023-01-01 06:56:08.400243+00"",val:0.02),last:(ts:""2023-01-31 15:54:08.149332+00"",val:0),weighted_sum:10386759309142.107,method:LOCF)"
"2022-11-30 23:00:00+00"	"(version:1,first:(ts:""2022-12-01 06:47:15.756845+00"",val:0.1),last:(ts:""2022-12-31 15:19:45.953033+00"",val:0),weighted_sum:7417551236871.949,method:LOCF)"

Using this timestamps and a 1 month interval only adds 30 days, which results in the error above

Removing the timezone from the query, results in the following set:

"2022-12-01 00:00:00+00"	"(version:1,first:(ts:""2022-12-01 06:47:15.756845+00"",val:0.1),last:(ts:""2022-12-31 15:19:45.953033+00"",val:0),weighted_sum:7417551236871.949,method:LOCF)"
"2023-01-01 00:00:00+00"	"(version:1,first:(ts:""2023-01-01 06:56:08.400243+00"",val:0.02),last:(ts:""2023-01-31 15:54:08.149332+00"",val:0),weighted_sum:10386759309142.107,method:LOCF)"
"2023-02-01 00:00:00+00"	"(version:1,first:(ts:""2023-02-01 07:03:17.904652+00"",val:0.07),last:(ts:""2023-02-21 11:05:53.185773+00"",val:88.63),weighted_sum:16651660179395.787,method:LOCF)"

which works correctly.

Is there any way to work around this issue?

To Reproduce
Steps to reproduce the behavior:
Use a query above with time_bucket and timezone over several months (including months with 31 days)

Expected behavior
The query should run even when using time_bucket with timezone data

Actual behavior
The query fails with above error.

Oy. Time zones, and especially their interactions with months are truly the worst.

So, I did some digging, and I think what's going on here is that time_bucket('1 month', time, time zone) returns a timestamptz of the start of the month at that time zone, which is correct, but if I add one month to that time stamp, now not knowing I'm working in the specified time zone, the logic doesn't work exactly the same way, so you end up with some weirdness - so what I'd do to fix that is create a quick function:

CREATE OR REPLACE FUNCTION calc_month_interval(day timestamptz, tz text) RETURNS INTERVAL as $$
SELECT ($1 at time zone $2) + interval '1 month' - ($1 at time zone $2) $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

That does the conversion to the correct time zone and then call it in your SQL statement:

SELECT bucket, calc_month_interval(bucket, 'CET'), toolkit_experimental.interpolated_integral(agg, bucket, calc_month_interval(bucket,'CET'), LAG(agg) OVER (ORDER BY bucket), LEAD(agg) OVER (ORDER BY bucket), 'hours')
FROM (
    SELECT time_bucket('1month', time, 'CET') AS bucket, time_weight('LOCF', time, value) as agg
    FROM measurements WHERE variable = 1 
    AND time >= '2022-12-01 00:00:00' 
    AND time <= '2023-02-21 00:00:00' GROUP BY bucket)
temp ORDER BY bucket;

This is a bit unfortunate and we may be able to create a function that helps with this...but it's a bit of a weird edge cases with the interactions of time zones, buckets, and months and oof. Good catch. Let us know if this helps!

David's solution here is much more robust, so I'm just going to delete my last comment.

This appears to be somewhat less robust and only work if your server time zone is set to 'UTC', I'm not 100% sure why yet, but we'll keep looking into this, it may be a bug around how we do time buckets with time zone offsets related in some way to: timescale/timescaledb#5334

syvb commented

The problem appears to be that we are using timestamptz_pl_interval to compute the end time from the given start and interval parameters, and timestamptz_pl_interval always uses the session timezone to add the months and days of the interval to the timestamptz. AFAICT there isn't any direct alternative to timestamptz_pl_interval that allows specifying the timezone.

Since session configuration is transactional, another workaround is set the session timezone in a transaction, make your query, then rollback.
edit: or use SET LOCAL within a transaction, then you don't even need to rollback it

@davidkohn88

Thanks for the workaround, it's working good.

We also experimented with using the simple interval/average functions like this

SELECT time_bucket('1 month', time) as dt, toolkit_experimental.integral(time_weight('LOCF', time, value), 'hours') AS tw
FROM measurements
WHERE variable = 1
AND time >= '2023-01-01'
AND time <= '2023-02-21'
GROUP BY dt
ORDER BY dt

but the results, especially with fine grained buckets (1m, 15m) were not as good as with the interpolation variants