Add Spark support

Question

jtcohen6 opened this issue 4 years ago · 1 comments

I could have a WIP PR for this very quickly. The changes to the default implementations are actually very simple:

Use cast(column as datatype) instead of column::datatype
Implement spark__convert_timezone and spark__dateadd
Call snowplow.dateadd, which then calls dbt_utils.dateadd or spark__dateadd` as appropriate

Thoughts:

Should Snowplow encode the logic for spark__dateadd, or pull it from (e.g.) spark-utils? I wouldn't want that dependency, however, in the 99% of cases when people aren't running this package on Spark. This will all be easier once we reimplement adapter macros.
It's trivial to implement this for Delta, which supports merging on a unique key, but what about standard Spark (insert_overwrite) incrementals? Do we need to add a truncated date column to these models, to pass to partition_by? I don't think we can partition on a column expression

Answer 1 · 2020-11-24T03:09:07.000Z

Closing, addressed via spark_utils