timescale/timescaledb

[Bug]: time_bucket_gapfill only handles time zone correctly if it is equal to the DB time zone

maxhertrampf opened this issue · 7 comments

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Gapfill

What happened?

Since TimescaleDB version 2.9.0, time_bucket_gapfill can handle time zones with the additional timezone parameter.

However, when the database's time zone is not equal to the timezone parameter, the function produces wrong results. The timestamps, the filled gaps, and even the number of results are incorrect (see below).

This happens when aggregating over a longer period (e.g. two years) in monthly intervals.
It does not occur in some other cases with fewer values and a shorter interval.

A workaround is to set the time zone for the transaction to the same one as the one used for the gapfilling, e.g. via set local timezone = 'Europe/Berlin', but this should not be necessary.

TimescaleDB version affected

2.9.1

PostgreSQL version used

14.6

What operating system did you use?

Ubuntu 22.04.1 LTS

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

The following creates a table metrics with a few rows and then tries to fill gaps in monthly buckets and a time range of two years.

show timezone; -- UTC

CREATE TABLE metrics
(
    time      TIMESTAMP WITH TIME ZONE,
    device_id INT,
    value     DOUBLE PRECISION
);

INSERT INTO metrics
VALUES ('2022-01-01T00:00:00+01:00', 1, 2),
       ('2023-01-01T00:00:00+01:00', 1, 3),
       ('2023-03-01T00:00:00+01:00', 1, 3.3);


SELECT time_bucket_gapfill(INTERVAL '1 mons', time, timezone => 'Europe/Berlin') AS "month",
       avg(value)       as avg,
       locf(avg(value)) AS "locf_avg"
FROM metrics rd
WHERE device_id = 1 AND time >= '2022-01-01T00:00:00+01:00' AND time < '2024-01-01T00:00:00+01:00'
GROUP BY month
ORDER BY month
LIMIT 1001;

The result is the following:

month avg locf_avg
2021-12-31 23:00:00.000000 +00:00 2 2
2022-01-31 23:00:00.000000 +00:00 null 2
2022-02-28 23:00:00.000000 +00:00 null 2
2022-03-28 23:00:00.000000 +00:00 null 2
2022-04-28 23:00:00.000000 +00:00 null 2
2022-05-28 23:00:00.000000 +00:00 null 2
2022-06-28 23:00:00.000000 +00:00 null 2
2022-07-28 23:00:00.000000 +00:00 null 2
2022-08-28 23:00:00.000000 +00:00 null 2
2022-09-28 23:00:00.000000 +00:00 null 2
2022-10-28 23:00:00.000000 +00:00 null 2
2022-11-28 23:00:00.000000 +00:00 null 2
2022-12-28 23:00:00.000000 +00:00 null 2
2023-01-28 23:00:00.000000 +00:00 null 2
2023-02-28 23:00:00.000000 +00:00 null 2
2023-03-28 23:00:00.000000 +00:00 null 2
2023-04-28 23:00:00.000000 +00:00 null 2
2023-05-28 23:00:00.000000 +00:00 null 2
2023-06-28 23:00:00.000000 +00:00 null 2
2023-07-28 23:00:00.000000 +00:00 null 2
2023-08-28 23:00:00.000000 +00:00 null 2
2023-09-28 23:00:00.000000 +00:00 null 2
2023-10-28 23:00:00.000000 +00:00 null 2
2023-11-28 23:00:00.000000 +00:00 null 2
2023-12-28 23:00:00.000000 +00:00 null 2
2022-12-31 23:00:00.000000 +00:00 3 3
2023-02-28 23:00:00.000000 +00:00 3.3 3.3

As you can see, the results are wrong:

  1. After 2022-02-28, the days and times are wrong (the 28th is not the correct last day for months other than February and the offset should be +02:00 for summer time)
  2. All gaps are filled with the first value, 2
  3. At the end of the result list, the other two values (3 and 3.3) are just added raw

Now, if I change the DB's time zone to German time via set timezone TO 'Europe/Berlin', the same query returns the correct results:

month avg locf_avg
2022-01-01 00:00:00.000000 +01:00 2 2
2022-02-01 00:00:00.000000 +01:00 null 2
2022-03-01 00:00:00.000000 +01:00 null 2
2022-04-01 00:00:00.000000 +02:00 null 2
2022-05-01 00:00:00.000000 +02:00 null 2
2022-06-01 00:00:00.000000 +02:00 null 2
2022-07-01 00:00:00.000000 +02:00 null 2
2022-08-01 00:00:00.000000 +02:00 null 2
2022-09-01 00:00:00.000000 +02:00 null 2
2022-10-01 00:00:00.000000 +02:00 null 2
2022-11-01 00:00:00.000000 +01:00 null 2
2022-12-01 00:00:00.000000 +01:00 null 2
2023-01-01 00:00:00.000000 +01:00 3 3
2023-02-01 00:00:00.000000 +01:00 null 3
2023-03-01 00:00:00.000000 +01:00 3.3 3.3
2023-04-01 00:00:00.000000 +02:00 null 3.3
2023-05-01 00:00:00.000000 +02:00 null 3.3
2023-06-01 00:00:00.000000 +02:00 null 3.3
2023-07-01 00:00:00.000000 +02:00 null 3.3
2023-08-01 00:00:00.000000 +02:00 null 3.3
2023-09-01 00:00:00.000000 +02:00 null 3.3
2023-10-01 00:00:00.000000 +02:00 null 3.3
2023-11-01 00:00:00.000000 +01:00 null 3.3
2023-12-01 00:00:00.000000 +01:00 null 3.3

Thank you @maxhertrampf for reaching out!

  1. After 2022-02-28, the days and times are wrong (the 28th is not the correct last day for months other than February and the offset should be +02:00 for summer time)
    this looks pretty bad - it also affects time_bucket; and unfortunately its also there on the current main branch. Thank you for reporting it!
  2. All gaps are filled with the first value, 2
    yes; they are - I've just read the locf docs and it mentions:
The locf (last observation carried forward) function allows you to carry the last seen value in an aggregation group forward

I think with this description if the last observation was n rows ago - that should still count as a last observation ; so I guess its ok.

Do you want to get the last value only iff the last was not null? You could try something like this (this involves windowing for showcase):

set timezone='Europe/Berlin';

--explain
SELECT time_bucket_gapfill(INTERVAL '1 mons', time,timezone => 'Europe/Berlin') AS "month",
       avg(value)       as avg,
       locf(avg(value)),
       coalesce(avg(value),lag(avg(value)) over ( order by time_bucket_gapfill(INTERVAL '1 mons', time,timezone => 'Europe/Berlin'))),
        1
FROM metrics rd
WHERE device_id = 1 AND time >= '2022-01-01T00:00:00+01:00' AND time < '2024-01-01T00:00:00+01:00'
GROUP BY month
ORDER BY month
LIMIT 1001;
  1. At the end of the result list, the other two values (3 and 3.3) are just added raw

I think this be the sideeffect of the 1. issue.

I'll put this issue in the backlog ; we'll get back to it!

@kgyrtkirk , thank you for analysing the problem.

  1. Thanks for confirming it. Just as a side note, from our point of view, the problem in time_bucket is not as severe

For example, the following SQL ...

set timezone = 'UTC';

SELECT time_bucket(INTERVAL '1 mons', time, timezone => 'Europe/Berlin') AS "month",
       avg(value)       as avg
FROM metrics rd
WHERE device_id = 1 AND time >= '2022-01-01T00:00:00+01:00' AND time < '2024-01-01T00:00:00+01:00'
GROUP BY month
ORDER BY month
LIMIT 1001;

... correctly returns:

month avg
2021-12-31 23:00:00.000000 +00:00 2
2022-12-31 23:00:00.000000 +00:00 3
2023-02-28 23:00:00.000000 +00:00 3.3

The only "problem" here is that the timestamps are formatted in the UTC time zone, but they are correct. For us, that is not a real problem.

With set timezone = 'Europe/Berlin, the same timestamps are returned, just formatted in Europe/Berlin:

month avg
2022-01-01 00:00:00.000000 +01:00 2
2023-01-01 00:00:00.000000 +01:00 3
2023-03-01 00:00:00.000000 +01:00 3.3
  1. I have to disagree with you here. 2 is not the last seen value for all these timestamps.
    Please compare the wrong result with the correct result.

formatted in the UTC time zone

oh - you are right!

  1. [...] Please compare the wrong result with the correct result.

I was not comparing them that closely but I think we should propably re-check 2. again after 1. will be fixed - because it seems like gapfill have failed to pickup the 2023-02-28 around the right time - the resultset is not even in ORDER BY month order!

Hi, I have the same or similar issue (for my description see timescale/timescaledb-toolkit#746) and was wondering if there is an estimate for when this could be resolved?

Hi @antekresic! I think your PR #6155 fixes this issue as well.

Yes, that's correct. This has been fixed in the recent 2.12.2 release.

I'm closing this ticket as fixed, feel free to reopen it if you have any other concerns.

This has been fixed in the recent 2.12.2 release. I'm closing this ticket as fixed, feel free to reopen it if you have any other concerns.

@antekresic + @lkshminarayanan I don't believe this is fixed. I'm currently on timescale 2.12.2 and postgres 14.9, and using the following SQL and time_bucket_gapfill, I see date "duplication" for dates following daylight savings time:

SHOW timezone; -- UTC

CREATE TABLE metrics
(
    time      TIMESTAMP WITH TIME ZONE,
    value     DOUBLE PRECISION
);

INSERT INTO metrics
VALUES ('2023-11-05T00:00:00+01:00', 2),
       ('2023-11-06T00:00:00+01:00', 3),
       ('2023-11-07T00:00:00+01:00', 3),
       ('2023-11-08T00:00:00+01:00', 3),
       ('2023-11-09T00:00:00+01:00', 3.3);


SELECT avg(m.value), time_bucket_gapfill('1 day',m.time,'America/Denver') as day_bucket
FROM metrics m
WHERE m.time > (now()) - interval '29 days'
AND m.time < (now())
GROUP BY day_bucket;

Results (33 rows):

+--------------------+------------------------+
|        avg         |       day_bucket       |
+--------------------+------------------------+
|               NULL | 2023-10-31 06:00:00+00 |
|               NULL | 2023-11-01 06:00:00+00 |
|               NULL | 2023-11-02 06:00:00+00 |
|               NULL | 2023-11-03 06:00:00+00 |
|                  2 | 2023-11-04 06:00:00+00 |
|                  3 | 2023-11-05 06:00:00+00 |
|               NULL | 2023-11-06 06:00:00+00 |
|               NULL | 2023-11-07 06:00:00+00 |
|               NULL | 2023-11-08 06:00:00+00 |
|               NULL | 2023-11-09 06:00:00+00 |
|               NULL | 2023-11-10 06:00:00+00 |
|               NULL | 2023-11-11 06:00:00+00 |
|               NULL | 2023-11-12 06:00:00+00 |
|               NULL | 2023-11-13 06:00:00+00 |
|               NULL | 2023-11-14 06:00:00+00 |
|               NULL | 2023-11-15 06:00:00+00 |
|               NULL | 2023-11-16 06:00:00+00 |
|               NULL | 2023-11-17 06:00:00+00 |
|               NULL | 2023-11-18 06:00:00+00 |
|               NULL | 2023-11-19 06:00:00+00 |
|               NULL | 2023-11-20 06:00:00+00 |
|               NULL | 2023-11-21 06:00:00+00 |
|               NULL | 2023-11-22 06:00:00+00 |
|               NULL | 2023-11-23 06:00:00+00 |
|               NULL | 2023-11-24 06:00:00+00 |
|               NULL | 2023-11-25 06:00:00+00 |
|               NULL | 2023-11-26 06:00:00+00 |
|               NULL | 2023-11-27 06:00:00+00 |
|               NULL | 2023-11-28 06:00:00+00 |
|               NULL | 2023-11-29 06:00:00+00 |
|                  3 | 2023-11-06 07:00:00+00 |
|                  3 | 2023-11-07 07:00:00+00 |
| 3.2999999999999994 | 2023-11-08 07:00:00+00 |
+--------------------+------------------------+
(33 rows)

And I, too, can set the local timezone and get the correct results:

begin;
  SET LOCAL time zone 'America/Denver';
  SELECT avg(m.value), time_bucket_gapfill('1 day',m.time,'America/Denver') as day_bucket
  FROM metrics m
  WHERE m.time > (now()) - interval '29 days'
  AND m.time < (now())
  GROUP BY day_bucket;
commit;

Results (30 rows)

+--------------------+------------------------+
|        avg         |       day_bucket       |
+--------------------+------------------------+
|               NULL | 2023-10-31 00:00:00-06 |
|               NULL | 2023-11-01 00:00:00-06 |
|               NULL | 2023-11-02 00:00:00-06 |
|               NULL | 2023-11-03 00:00:00-06 |
|                  2 | 2023-11-04 00:00:00-06 |
|                  3 | 2023-11-05 00:00:00-06 |
|                  3 | 2023-11-06 00:00:00-07 |
|                  3 | 2023-11-07 00:00:00-07 |
| 3.2999999999999994 | 2023-11-08 00:00:00-07 |
|               NULL | 2023-11-09 00:00:00-07 |
|               NULL | 2023-11-10 00:00:00-07 |
|               NULL | 2023-11-11 00:00:00-07 |
|               NULL | 2023-11-12 00:00:00-07 |
|               NULL | 2023-11-13 00:00:00-07 |
|               NULL | 2023-11-14 00:00:00-07 |
|               NULL | 2023-11-15 00:00:00-07 |
|               NULL | 2023-11-16 00:00:00-07 |
|               NULL | 2023-11-17 00:00:00-07 |
|               NULL | 2023-11-18 00:00:00-07 |
|               NULL | 2023-11-19 00:00:00-07 |
|               NULL | 2023-11-20 00:00:00-07 |
|               NULL | 2023-11-21 00:00:00-07 |
|               NULL | 2023-11-22 00:00:00-07 |
|               NULL | 2023-11-23 00:00:00-07 |
|               NULL | 2023-11-24 00:00:00-07 |
|               NULL | 2023-11-25 00:00:00-07 |
|               NULL | 2023-11-26 00:00:00-07 |
|               NULL | 2023-11-27 00:00:00-07 |
|               NULL | 2023-11-28 00:00:00-07 |
|               NULL | 2023-11-29 00:00:00-07 |
+--------------------+------------------------+
(30 rows)

Let me know if you'd like me to move this to a new issue, or keep the discussion here. Thanks!