fivetran/dbt_ad_reporting

[Bug] Incorrect surrogate key combination used in search report

clay-walker opened this issue · 4 comments

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

The Microsoft ads search report model uses the following columns as a surrogate key:

        - date_day
        - account_id
        - campaign_id
        - ad_group_id
        - ad_id
        - keyword_id
        - search_query
        - device_os
        - device_type
        - network

This combination of columns does not uniquely identify a record, and the test fails. The cause of this is:

  1. match_type (a concatenation of two fields) is used in the select from the base table, but not included in the surrogate key. The two fields that are concatenated into match_type are actually part of the key for the base table (bingads.keyword_performance_daily_report). This is a screenshot of Fivetran as evidence:
    image

  2. Additionally, keyword_id is part of the composite key, but that field is left joined from the keyword history table. Since it's left joined, and keywords are apparently hard deleted from the history table (wtf Microsoft!), the resulting value will be null, which is no bueno for a field that is used in a surrogate key.

Proposed solution:

  1. add match_type to the unique test in microsoft_ads.yml
  2. Inner join to keyword to remove the possibility of null keyword_id values in the surrogate key

Relevant error log or model output

Failure in test dbt_utils_unique_combination_of_columns_microsoft_ads__search_report_date_day__account_id__campaign_id__ad_group_id__ad_id__keyword_id__search_query__device_os__device_type__network (models/microsoft_ads.yml)

Expected behavior

Passing tests

dbt Project configurations

vars:
prod_database: analytics
prod_schema: business
operdb_schema_pattern: 'operdb%'
raw_data_db: GONG
"dbt_date:time_zone": "America/Los_Angeles"
timezone_constant: "America/Los_Angeles"

ad_reporting__apple_search_ads_enabled: False # by default this is assumed to be True
ad_reporting__pinterest_ads_enabled: False # by default this is assumed to be True
ad_reporting__twitter_ads_enabled: False # by default this is assumed to be True
ad_reporting__facebook_ads_enabled: False # by default this is assumed to be True
ad_reporting__snapchat_ads_enabled: False # by default this is assumed to be True
ad_reporting__tiktok_ads_enabled: False # by default this is assumed to be True

linkedin_ads_schema: linkedin_ads
linkedin_ads_database: source_raw

google_ads_schema: google_ads
google_ads_database: source_raw

microsoft_ads_schema: bingads
microsoft_ads_database: source_raw

Package versions

packages:

  • package: fivetran/salesforce_formula_utils
    version: 0.6.4

  • package: calogica/dbt_date
    version: 0.5.7

  • package: fivetran/ad_reporting
    version: [">=1.0.0", "<1.1.0"]

What database are you using dbt with?

snowflake

dbt Version

1.0.1

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.

Hi @clay-walker thanks for opening this issue. Someone from my team will follow up here in the next few days to discuss the issue further and determine a possible solution.

hey @clay-walker i definitely agree that match_type should be included in the surrogate key!

as for your second point, does microsoft hard-delete keyword_id from the daily search report? i am wondering if changing this line to reference report.keyword_id (instead of keyword_history) would result in no nulls. i must admit i'm a little apprehensive to apply an inner join 😨

also just curious - do you know if microsoft hard deletes IDs from other history tables (ie campaign or ad_group?)

looks like someone else is seeing this issue with the microsoft ad_group report as well - fivetran/dbt_microsoft_ads#18. and we've just confirmed that keyword_id or other ids are not hard-deleted from the report tables.

so my next question would be, what's your opinion on referencing the report.keyword_id instead, so that if a keyword has been hard-deleted, you'd still have its ID but not any other info about it from the keyword_history table? would that be valuable or confusing/unnecessary to have, as opposed to filtering them out with an inner join? i presume that some people would at least want to keep them 🤔

i am thinking that the path forward will be

  1. adding match_type to the surrogate key
  2. auditing the whole microsoft package for any instances in which we selecting an id (such as keyword_id) in the final CTE of a model, but from the right side of the join. we should change this to reference the report on the left side of the join

Hi @clay-walker the latest version of the package (v1.0.4) should address this issue you identified within this bug report. Feel free to upgrade to the latest version and the initial issue should be addressed.

For the time being I will close this issue as the fix has been rolled out. Please feel free to reopen or create a new issue if you experience any other questions when using the package.