Add Spark support
jtcohen6 opened this issue · 1 comments
jtcohen6 commented
I could have a WIP PR for this very quickly. The changes to the default implementations are actually very simple:
- Use
cast(column as datatype)
instead ofcolumn::datatype
- Implement
spark__convert_timezone
andspark__dateadd
- Call
snowplow.dateadd
, which then callsdbt_utils.dateadd or
spark__dateadd` as appropriate
Thoughts:
- Should Snowplow encode the logic for
spark__dateadd
, or pull it from (e.g.) spark-utils? I wouldn't want that dependency, however, in the 99% of cases when people aren't running this package on Spark. This will all be easier once we reimplement adapter macros. - It's trivial to implement this for Delta, which supports merging on a unique key, but what about standard Spark (
insert_overwrite
) incrementals? Do we need to add a truncated date column to these models, to pass topartition_by
? I don't think we can partition on a column expression
jtcohen6 commented
Closing, addressed via spark_utils