[Bug] Incorrect surrogate key combination used in search report
clay-walker opened this issue · 4 comments
Is there an existing issue for this?
- I have searched the existing issues
Describe the issue
The Microsoft ads search report model uses the following columns as a surrogate key:
- date_day
- account_id
- campaign_id
- ad_group_id
- ad_id
- keyword_id
- search_query
- device_os
- device_type
- network
This combination of columns does not uniquely identify a record, and the test fails. The cause of this is:
-
match_type (a concatenation of two fields) is used in the select from the base table, but not included in the surrogate key. The two fields that are concatenated into match_type are actually part of the key for the base table (bingads.keyword_performance_daily_report). This is a screenshot of Fivetran as evidence:
-
Additionally, keyword_id is part of the composite key, but that field is left joined from the keyword history table. Since it's left joined, and keywords are apparently hard deleted from the history table (wtf Microsoft!), the resulting value will be
null
, which is no bueno for a field that is used in a surrogate key.
Proposed solution:
- add match_type to the unique test in microsoft_ads.yml
- Inner join to keyword to remove the possibility of null keyword_id values in the surrogate key
Relevant error log or model output
Failure in test dbt_utils_unique_combination_of_columns_microsoft_ads__search_report_date_day__account_id__campaign_id__ad_group_id__ad_id__keyword_id__search_query__device_os__device_type__network (models/microsoft_ads.yml)
Expected behavior
Passing tests
dbt Project configurations
vars:
prod_database: analytics
prod_schema: business
operdb_schema_pattern: 'operdb%'
raw_data_db: GONG
"dbt_date:time_zone": "America/Los_Angeles"
timezone_constant: "America/Los_Angeles"
ad_reporting__apple_search_ads_enabled: False # by default this is assumed to be True
ad_reporting__pinterest_ads_enabled: False # by default this is assumed to be True
ad_reporting__twitter_ads_enabled: False # by default this is assumed to be True
ad_reporting__facebook_ads_enabled: False # by default this is assumed to be True
ad_reporting__snapchat_ads_enabled: False # by default this is assumed to be True
ad_reporting__tiktok_ads_enabled: False # by default this is assumed to be True
linkedin_ads_schema: linkedin_ads
linkedin_ads_database: source_raw
google_ads_schema: google_ads
google_ads_database: source_raw
microsoft_ads_schema: bingads
microsoft_ads_database: source_raw
Package versions
packages:
-
package: fivetran/salesforce_formula_utils
version: 0.6.4 -
package: calogica/dbt_date
version: 0.5.7 -
package: fivetran/ad_reporting
version: [">=1.0.0", "<1.1.0"]
What database are you using dbt with?
snowflake
dbt Version
1.0.1
Additional Context
No response
Are you willing to open a PR to help address this issue?
- Yes.
- Yes, but I will need assistance and will schedule time during our office hours for guidance
- No.
Hi @clay-walker thanks for opening this issue. Someone from my team will follow up here in the next few days to discuss the issue further and determine a possible solution.
hey @clay-walker i definitely agree that match_type
should be included in the surrogate key!
as for your second point, does microsoft hard-delete keyword_id
from the daily search report? i am wondering if changing this line to reference report.keyword_id
(instead of keyword_history) would result in no nulls. i must admit i'm a little apprehensive to apply an inner join 😨
also just curious - do you know if microsoft hard deletes IDs from other history tables (ie campaign or ad_group?)
looks like someone else is seeing this issue with the microsoft ad_group report as well - fivetran/dbt_microsoft_ads#18. and we've just confirmed that keyword_id
or other ids are not hard-deleted from the report tables.
so my next question would be, what's your opinion on referencing the report.keyword_id
instead, so that if a keyword has been hard-deleted, you'd still have its ID but not any other info about it from the keyword_history
table? would that be valuable or confusing/unnecessary to have, as opposed to filtering them out with an inner join? i presume that some people would at least want to keep them 🤔
i am thinking that the path forward will be
- adding match_type to the surrogate key
- auditing the whole microsoft package for any instances in which we selecting an
id
(such askeyword_id
) in the final CTE of a model, but from the right side of the join. we should change this to reference the report on the left side of the join
Hi @clay-walker the latest version of the package (v1.0.4
) should address this issue you identified within this bug report. Feel free to upgrade to the latest version and the initial issue should be addressed.
For the time being I will close this issue as the fix has been rolled out. Please feel free to reopen or create a new issue if you experience any other questions when using the package.