linkedin/coral

Bug: `FromUnixtimeOperatorTransformer` possibly built on false supposisions.

findinpath opened this issue · 2 comments

The FromUnixtimeOperatorTransformer from https://github.com/linkedin/coral/pull/426/files#diff-e26e325f65fb1f3341468a92d2e1051ee2b8b94b56c74203054aca3b2c790e4e
seems to be built on false premises.

The function https://trino.io/docs/current/functions/datetime.html#from_unixtime returns a timestamp(3) with time zone

However, the FromUnixtimeOperatorTransformer transforms from_unixtime expressions to
FORMAT_DATETIME(FROM_UNIXTIME(10000), 'yyyy-MM-dd HH:mm:ss') which is a varchar
This is causing the failure of statements like the following:

select 
  CAST(
    CAST(
      "at_timezone" (
        "format_datetime" (
          "from_unixtime" (
            CAST(
              "test_from_utc_timestamp_source"."source_float" AS DOUBLE
            )
          ),
          'yyyy-MM-dd HH:mm:ss'
        ),
        "$canonicalize_hive_timezone_id" ('America/Los_Angeles')
      ) AS TIMESTAMP (3)
    ) AS VARCHAR (65535)
  ) AS "ts_float"
from test_from_utc_timestamp_source 

with the exception:

Caused by: io.trino.spi.TrinoException: Unexpected parameters (varchar, varchar) for function at_timezone. Expected: at_timezone(timestamp(p) with time zone, interval day to second), at_timezone(timestamp(p) with time zone, varchar(x))

Related PR: #426

Steps to reproduce the issue

hive

CREATE TABLE test_from_utc_timestamp_source (source_float float);
CREATE VIEW test_from_utc_timestamp_view AS SELECT CAST(from_utc_timestamp(source_float, 'America/Los_Angeles') AS STRING) ts_float FROM test_from_utc_timestamp_source;

select * from test_from_utc_timestamp_view;

-- 1970-01-30 21:30:00

trino

insert into hive.default.test_from_utc_timestamp_source values (2592000.0);

trino version 420 while using coral version 2.2.9

SELECT * FROM hive.default.test_from_utc_timestamp_view;
Query 20231013_184405_00000_fa5un failed: line 1:15: Failed analyzing stored view 'hive.default.test_from_utc_timestamp_view': line 1:18: Unexpected parameters (varchar, varchar) for function at_timezone. Expected: at_timezone(timestamp(p) with time zone, varchar(x)), at_timezone(timestamp(p) with time zone, interval day to second)

trinoSql in ViewReaderUtil

SELECT CAST(CAST("at_timezone"("format_datetime"("from_unixtime"(CAST("test_from_utc_timestamp_source"."source_float" AS DOUBLE)), 'yyyy-MM-dd HH:mm:ss'), "$canonicalize_hive_timezone_id"('America/Los_Angeles')) AS TIMESTAMP(3)) AS VARCHAR(65535)) AS "ts_float"
FROM "default"."test_from_utc_timestamp_source" AS "test_from_utc_timestamp_source"

trino version 420 while using coral version 2.1.5

trino> SELECT * FROM hive.default.test_from_utc_timestamp_view;
        ts_float         
-------------------------
 1970-01-30 16:00:00.000 

trinoSql in ViewReaderUtil

SELECT CAST(CAST("at_timezone"("from_unixtime"(CAST("test_from_utc_timestamp_source"."source_float" AS DOUBLE)), "$canonicalize_hive_timezone_id"('America/Los_Angeles')) AS TIMESTAMP(3)) AS VARCHAR(65535)) AS "ts_float"
FROM "default"."test_from_utc_timestamp_source" AS "test_from_utc_timestamp_source"

https://github.com/trinodb/trino/blob/fa01166036c540a498af2beeedcf01fdc6d9e7c0/plugin/trino-hive/src/main/java/io/trino/plugin/hive/ViewReaderUtil.java#L246

2.1.5

CAST(CAST("at_timezone"("from_unixtime"...

2.2.9

CAST(CAST("at_timezone"("format_datetime"("from_unixtime"

format_datetime shouldn't be there.

FromUtcTimestampOperatorTransformer is playing with both:

  • from_unixtime_nanos
  • from_unixtime

from_unixtime_nanos is a Trino specific function https://trino.io/docs/current/functions/datetime.html?highlight=from_unixtime_nanos#from_unixtime_nanos

Translation of from_unixtime_nanos works as expected.

from_unixtime is:

from_unixtime(bigint unixtime[, string pattern])

Converts a number of seconds since epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current time zone(using config "hive.local.time.zone") using the specified pattern. If the pattern is missing the default is used ('uuuu-MM-dd HH:mm:ss' or yyyy-MM-dd HH:mm:ss'). Example: from_unixtime(0)=1970-01-01 00:00:00 (hive.local.time.zone=Etc/GMT)

Given the fact that from_unixtime is a Hive function, it gets its own FromUnixtimeOperatorTransformer and that's where our problem actually occurs.
When we transform from_utc_timestamp via FromUtcTimestampOperatorTransformer we're creating (in case of dealing with floats) a nested from_unixtime call which gets unintentionally transformed as well to "at_timezone"("from_unixtime" .