BUG - The stg_fivetran_log__log model does now handle duplicate rows in the source logs table
Closed this issue ยท 11 comments
Are you a current Fivetran customer?
Jacob Mulligan, head of analytics, Firefly Health
Describe the bug
We are currently experiencing an issue where the stg_fivetran_log__log
and fivetran_log
models are failing from the following issues:
09:40:01 Failure in test dbt_utils_unique_combination_of_columns_fivetran_log__schema_changelog_connector_id__destination_id__created_at (models/fivetran_log.yml)
09:40:01 Got 7 results, configured to fail if != 0
09:40:01
09:40:01 compiled SQL at target/compiled/fivetran_log/models/fivetran_log.yml/dbt_utils_unique_combination_o_a484aa854783c3b3614d0db8c28551e9.sql
09:40:01
09:40:01 Failure in test dbt_utils_unique_combination_of_columns_stg_fivetran_log__log_log_id__created_at (models/staging/stg_fivetran_log.yml)
09:40:01 Got 1423 results, configured to fail if != 0
09:40:01
09:40:01 compiled SQL at target/compiled/fivetran_log/models/staging/stg_fivetran_log.yml/dbt_utils_unique_combination_o_124db0b56a6fac04fff9f44f342dcecf.sql
The root issue is that there actually is row duplication in the logs
table created by the Fivetran Logs connector.
I first ran the SQL for the above dbt_utils_unique_combination_of_columns_stg_fivetran_log__log_log_id__created_at
test failure which led me to a few specific duplicate rows:
select
log_id,
created_at,
count(*)
from analytics.dbt_jacob_mulligan_stg_fivetran_log.stg_fivetran_log__log
group by log_id, created_at
having count(*) > 1
;
I then ran this SQL to confirm that there are duplicate rows across the log_id
and created_at
columns in the raw log data that Fivetran has written into Snowflake:
select *
from raw.fivetran_logs.log
where
id = 'logic_exporting'
and time_stamp = '2022-02-09 18:52:45.842 +0000'
When I run this SQL I get the following output (sorry for the terrible output, I don't know how to format this better!) Notice that the only difference is in the SEQUENCE_NUMBER
column, 16,112
vs 16,117
.
TIME_STAMP CONNECTOR_ID TRANSFORMATION_ID EVENT MESSAGE_EVENT MESSAGE_DATA _FIVETRAN_SYNCED ID SYNC_ID SEQUENCE_NUMBER PROCESS_ID
2022-02-09 18:52:45.842 +0000 logic_exporting INFO copy_rows {"schema":"fivetran_wick_crumpet_staging","name":"lucian_public_tasks-staging-4f8f229f-5b71-465a-8c71-fe4a47e25695"} 2022-02-09 19:16:13.412 +0000 logic_exporting b5e47b18-dc88-4246-9f33-64512205e195 16,112 4e4d37a2-b9b9-43be-b9ea-45715f302903
2022-02-09 18:52:45.842 +0000 logic_exporting INFO copy_rows {"schema":"fivetran_wick_crumpet_staging","name":"lucian_public_tasks-staging-4f8f229f-5b71-465a-8c71-fe4a47e25695"} 2022-02-09 19:16:13.412 +0000 logic_exporting b5e47b18-dc88-4246-9f33-64512205e195 16,117 4e4d37a2-b9b9-43be-b9ea-45715f302903
Should the stg_fivetran_log__log
model select the most recent row based on SEQUENCE_NUMBER
?
Expected behavior
I would expect the stg_fivetran_log__log
model (here) to handle duplicate rows. The model could either "just" select one of the two rows at random or (see above) grab the row with the larger SEQUENCE_NUMBER
assuming that a later sequence number means the row came later and presumably is the "right" row to reference.
Project variables configuration
vars:
fivetran_log:
# Link to the package that relies on these vars:
# https://hub.getdbt.com/fivetran/fivetran_log/latest/
fivetran_log_database: raw
fivetran_log_schema: fivetran_logs
models:
fivetran_log:
# Explicitly excluding because the runtime of this model is extremely slow,
# it's taken as long as 14 minutes to build this one table. We don't use
# this model, therefore there's no reason to slow down build time for it
fivetran_log__audit_table:
+enabled: false
Package Version
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
- package: dbt-labs/segment
version: 0.7.0
- package: fivetran/fivetran_log
version: 0.5.0
- package: fivetran/salesforce_source
version: 0.4.1
- package: fivetran/salesforce
version: 0.5.1
- package: fivetran/jira
version: 0.7.0
- package: calogica/dbt_date
version: 0.5.2
- package: brooklyn-data/dbt_artifacts
version: 0.6.0 #updated to 0.7.0, got a failed dbt test. checking in dbt slack about the error but reverting the change in the meantime back to 0.6.0
- package: entechlog/dbt_snow_mask
version: 0.1.6
- package: fivetran/github
version: 0.4.0
- package: fivetran/github_source
version: 0.4.1
- package: tnightengale/dbt_meta_testing
version: 0.3.5
Warehouse
- BigQuery
- Redshift
- Snowflake
- Postgres
- Databricks
- Other (provide details below)
Screenshots
The test failures in dbt Cloud:
Screenshot from Snowflake showing that the rows are identical except the sequence_number
column
Please indicate the level of urgency
This is currently causing all of our production runs of dbt to "fail" because these tests don't pass. We also haven't been able to figure out how to overwrite the test severity to warn
.
Low to moderate level of severity.
Are you interested in contributing to this package?
- Yes, I can do this and open a PR for your review.
- Possibly, but I'm not quite sure how to do this. I'd be happy to do a live coding session with someone to get this fixed.
- No, I'd prefer if someone else fixed this. I don't have the time and/or don't know what the root cause of the problem is.
This might not be the ideal solution, I'm not intimiately familiar with the log data provided by Fivetran and thus I cannot speak perfectly to why the rows are duplicated in the raw table in the first place. That said, if the `` file is updated to include this qualify statement then the failure goes away for me:
with log as (
select * from {{ var('log') }}
),
fields as (
select
id as log_id,
time_stamp as created_at,
connector_id, -- Note: the connector_id column used to erroneously equal the connector_name, NOT its id.
case when transformation_id is not null and event is null then 'TRANSFORMATION'
else event end as event_type,
message_data,
case
when transformation_id is not null and message_data like '%has succeeded%' then 'transformation run success'
when transformation_id is not null and message_data like '%has failed%' then 'transformation run failed'
else message_event end as event_subtype,
transformation_id
from log
qualify row_number() over (partition by id, created_at order by time_stamp desc) = 1
)
select * from fields
Hi @jcbmllgn thanks so much for opening this issue and raising it to our attention. My original understanding is that these should be unique and I wonder what generates this SEQUENCE_NUMBER
and why some records would have multiples. Let me follow up internally to understand this better.
However, if it does come out that these multiple sequences are appropriate then the proposed change you have above makes sense to me! I will hopefully follow up shortly with an update.
Thanks @NoToWarAlways and @jcbmllgn. I am now seeing the same behavior in my data as well.
The search continues ๐
Hi @jcbmllgn and @NoToWarAlways just wanted to provide an update that I have opened an internal ticket for our engineering teams to take a look at the connector and this change to provide a comprehensive understanding of this behavior. I will let you all know once we get more details!
I would also just want to add to this that we are seeing this behaviour as well. Glad it is being looked into.
Hey All,
I have been able to work with our engineering team and found that this was a change on the connector end to address an edge case where two events are processed at the same time. However, since these duplicates have shown up it became apparent that this change should be reverted.
I have been made aware that the engineering team has reverted the changes on the connector. However, these records will still remain even if you perform a re-sync. With that, we discovered that these duplicates only occurred on February 9th and since the revert of the change there shouldn't be any other days that have these duplicates. Therefore, the best plan forward would be to delete these duplicate records from that day.
Once those records are deleted you should be able to see that the dbt test should no longer fail. Further, since this change has been reverted within the connector, the error would not show up again.
Please let me know if you have any questions!
We also ran into this issue. We did as @fivetran-joemarkiewicz recommended and removed the records that was causing the issue. Working fine now.
Hi,
here is the delete statement we used on our Snowflake database to deduplicate:
begin work;
delete
-- select f.*
from
"DATABASE"."FIVETRAN_LOG"."LOG" as f
using
-- join
(
select
id,
time_stamp,
process_id,
count(*) as cnt,
max(sequence_number) as second_seq_num
from
"DATABASE"."FIVETRAN_LOG"."LOG"
group by
id, time_stamp, process_id
having
count(*) > 1
) as d
where
f.id = d.id and
f.time_stamp = d.time_stamp and
f.process_id = d.process_id and
f.sequence_number = d.second_seq_num;
commit work;
Switch delete
, using
to select
, join
and do a test run before deleting!
I got word from the engineering team that the issue that caused this duplication originally has been patched. Further, with the suggestion above, the solution to remove the duplicate records works as intended.
Closing this issue, feel free to re-open it if this continues to persist. Thanks everyone!