duneanalytics/harmonizer

datediff -> DATE_DIFF

andrewhong5297 opened this issue · 6 comments

Seeing an issue where for some reason translation of date diff isn't quite correct. Giving the trino error: USER_ERROR line 6:76: Column 'day' cannot be resolved

Postgres/Spark input:

WITH date_range AS ( SELECT date(min(block_time)) + interval '1' day * seq AS date FROM ethereum.logs, UNNEST(sequence(0, datediff(day, min(block_time), max(block_time)))) AS t(seq) WHERE contract_address = 0x81f8f0bb1cB2A06649E51913A151F0E7Ef6FA321 ), daily_transactions AS ( SELECT date(block_time) AS date, count(*) AS num_transactions FROM ethereum.logs WHERE contract_address = 0x81f8f0bb1cB2A06649E51913A151F0E7Ef6FA321 GROUP BY date(block_time) ) SELECT dr.date, COALESCE(dt.num_transactions, 0) AS num_transactions FROM date_range dr LEFT JOIN daily_transactions dt ON dr.date = dt.date ORDER BY dr.date;

Trino:

/* Success! If you're still running into issues, check out https://dune.com/docs/query/syntax-differences/ or reach out in the #dune-sql Discord channel. */

WITH date_range AS (  SELECT  DATE(MIN(block_time)) + INTERVAL '1' day * seq AS date  FROM ethereum.logs, UNNEST(SEQUENCE(0, DATE_DIFF('day', MIN(block_time), day)) /* WARNING: Check out the docs for example of time series generation: https://dune.com/docs/query/syntax-differences/ */) AS t(seq)  WHERE  contract_address = 0x81f8f0bb1cb2a06649e51913a151f0e7ef6fa321 ), daily_transactions AS (  SELECT  DATE(block_time) AS date,  COUNT(*) AS num_transactions  FROM ethereum.logs  WHERE  contract_address = 0x81f8f0bb1cb2a06649e51913a151f0e7ef6fa321  GROUP BY  DATE(block_time) ) SELECT  dr.date,  COALESCE(dt.num_transactions, 0) AS num_transactions FROM date_range AS dr LEFT JOIN daily_transactions AS dt  ON dr.date = dt.date ORDER BY  dr.date

Great, thanks! It seems like it's swapping the order of the arguments and not quoting them correctly:

>>> sqlglot.transpile("select datediff('unit', from_, to)", read="postgres", write="trino")
["SELECT DATE_DIFF('to', from_, 'unit')"]

Hey 👋

I don't think that Postgres SQL supports the DATEDIFF function. Maybe this is a UDF? Also, Spark's documentation indicates that its DATEDIFF function accepts two arguments.

Are these queries valid, i.e. can they run on the corresponding engines? Happy to help if this is a SQLGlot issue!

Interestingly, spark-sql seems to actually support the unit argument:

spark-sql> SELECT DATEDIFF(day, '2020-01-01', '2020-01-05');
timestampdiff(day, 2020-01-01, 2020-01-05)
4
Time taken: 0.206 seconds, Fetched 1 row(s)
spark-sql> SELECT DATEDIFF(day, '2020-01-01', '2020-02-05');
timestampdiff(day, 2020-01-01, 2020-02-05)
35
Time taken: 0.168 seconds, Fetched 1 row(s)
spark-sql> SELECT DATEDIFF(month, '2020-01-01', '2020-02-05');
timestampdiff(month, 2020-01-01, 2020-02-05)
1
Time taken: 0.163 seconds, Fetched 1 row(s)

So this might be a documentation issue.. Verified that we get back a "function does not exist" error for postgres, though.

Ahh, it very well might be that it's a UDF! We have a fair amount of UDFs on the Postgres clusters. Will check. Thanks!

Confirmed that we don't actually have this in Postgres, so this would only be for Spark.

Interesting that it doesn't exist in the documentation!

Cool, since it runs on spark-sql & it's also supported by databricks, I think I'll just add it in Hive so that they're all covered. Thanks for checking!