datafold/data-diff

Snowflake and Postgres dialects normalize dates differently than other database types

RenanBasilio opened this issue · 1 comments

Describe the bug
In the Snowflake and Postgres database dialects date columns are normalized differently than in other dialects. This causes a false positive diff when comparing those dates with date columns in other database types.

For example, consider the comparison between MySQL and Snowflake:

This is what the schemas look like (from logging with --verbose):

INFO     [MySQL] Schema = {
    'id': Integer(_notes=[], precision=0, python_type=<class 'int'>),
    'some_date': Date(_notes=[], precision=6, rounds=True)}
INFO     [Snowflake] Schema = {
    'ID': Decimal(_notes=[], precision=0),
    'SOME_DATE': Date(_notes=[], precision=6, rounds=False)}

This query is used when pulling a date column from the MySQL database:

SELECT
    cast(`id` as char),
    cast(cast( cast(`some_date` as datetime(6)) as datetime(6)) as char)
FROM `some_schema`.`some_table`

This returns a pair like

1, 2024-03-28 00:00:00.000000

In turn, in Snowflake the query is

SELECT
    cast(cast("ID" as decimal(38, 0)) as string),
    ("SOME_DATE"::varchar)
FROM "SOME_SCHEMA"."SOME_TABLE"

which returns a row like

1, 2024-03-28

This is then detected as a mismatch.

This seems to stem from the normalize_timestamp function in the database dialect. In the BaseDialect, it is documented that the function should return a date formatted as YYYY-MM-DD HH:mm:SS.FFFFFF, which is accurate to the MySQL result but not to the Snowflake or Postgres dialects which only cast it directly as a string.

def normalize_timestamp(self, value: str, coltype: TemporalType) -> str:
if coltype.rounds:
return self.to_string(f"cast( cast({value} as datetime({coltype.precision})) as datetime(6))")
s = self.to_string(f"cast({value} as datetime(6))")
return f"RPAD(RPAD({s}, {TIMESTAMP_PRECISION_POS+coltype.precision}, '.'), {TIMESTAMP_PRECISION_POS+6}, '0')"

def normalize_timestamp(self, value: str, coltype: TemporalType) -> str:
def _add_padding(coltype: TemporalType, timestamp6: str):
return f"RPAD(LEFT({timestamp6}, {TIMESTAMP_PRECISION_POS+coltype.precision}), {TIMESTAMP_PRECISION_POS+6}, '0')"
try:
is_date = coltype.is_date
is_time = coltype.is_time
except:
is_date = False
is_time = False
if isinstance(coltype, Date) or is_date:
return f"cast({value} as varchar)"

Describe the environment

Data Diff Open Source v0.11.1
Python 3.9.12
MacOSX Sonoma 14.2.1 (23C71)

Hi @RenanBasilio,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. Feel free to take it for a trial or contact us at support@datafold.com if you have any questions.

-Gleb