dbt-labs/snowplow

Add Spark support

jtcohen6 opened this issue · 1 comments

I could have a WIP PR for this very quickly. The changes to the default implementations are actually very simple:

  • Use cast(column as datatype) instead of column::datatype
  • Implement spark__convert_timezone and spark__dateadd
  • Call snowplow.dateadd, which then calls dbt_utils.dateadd or spark__dateadd` as appropriate

Thoughts:

  • Should Snowplow encode the logic for spark__dateadd, or pull it from (e.g.) spark-utils? I wouldn't want that dependency, however, in the 99% of cases when people aren't running this package on Spark. This will all be easier once we reimplement adapter macros.
  • It's trivial to implement this for Delta, which supports merging on a unique key, but what about standard Spark (insert_overwrite) incrementals? Do we need to add a truncated date column to these models, to pass to partition_by? I don't think we can partition on a column expression

Closing, addressed via spark_utils