dbt-labs/snowplow

Snowflake web page context is unnecessarily separated and is not incremental

davehowell opened this issue · 3 comments

Describe the feature

The default models assume that the web_page_context data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something like CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID ? Has that been considered?

The snowplow_web_page_context model contains this comment:

-- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!

I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the snowplow_sessions_tmp model fails on merge.

Describe alternatives you've considered

To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):

QUALIFY
  row_number() over (partition by EVENT_ID order by COLLECTOR_TSTAMP, DERIVED_TSTAMP) = 1;

That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.

My concern is with the web_page_context model being separated from the events, not incremental, only to be later recombined into a number of other models.

If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?

Additional context

This is Snowflake specific.

Who will this benefit?

Those with huge amounts of evergrowing events from Snowplow pipelines.

Sorry for the serious delay in responding here, @davehowell. This is something I've wanted to do for some time (#52). The potential time (+ cost) savings on Snowflake are immense. I think the required pieces are extensive enough that it could (+ should) be part of a larger refactor.

Unfortunately I have moved on from the previous project and don't have access to Snowplow data to look at this again, but I recall that I created a few override snowplow models in the main dbt project in order to bypass some of this, as well as adding those flattened out fields into some of the special "include" fields that are available as vars.

If you don't think it's too much to bite off at one time then sure include it as part of your larger refactor. I have to say the Snowplow models are pretty complicated, a lot to get your head around, and this particular issue is very Snowflake specific so might be easier to do ( and more importantly for the reviewer!) separately. If not then I guess this can be closed.

I created a few override snowplow models in the main dbt project in order to bypass some of this, as well as adding those flattened out fields into some of the special "include" fields that are available as vars.

Honestly, this is what I've done every time I use Snowflake + Snowplow together. It would obviously be better if the package could "just work" on Snowflake in the way we know to be best.