BUG - The stg_fivetran_log__log model does now handle duplicate rows in the source logs table

Question

BUG - The stg_fivetran_log__log model does now handle duplicate rows in the source logs table

Closed this issue 3 years ago · 11 comments

Are you a current Fivetran customer?
Jacob Mulligan, head of analytics, Firefly Health

Describe the bug
We are currently experiencing an issue where the stg_fivetran_log__log and fivetran_log models are failing from the following issues:

09:40:01  Failure in test dbt_utils_unique_combination_of_columns_fivetran_log__schema_changelog_connector_id__destination_id__created_at (models/fivetran_log.yml)
09:40:01    Got 7 results, configured to fail if != 0
09:40:01  
09:40:01    compiled SQL at target/compiled/fivetran_log/models/fivetran_log.yml/dbt_utils_unique_combination_o_a484aa854783c3b3614d0db8c28551e9.sql
09:40:01  
09:40:01  Failure in test dbt_utils_unique_combination_of_columns_stg_fivetran_log__log_log_id__created_at (models/staging/stg_fivetran_log.yml)
09:40:01    Got 1423 results, configured to fail if != 0
09:40:01  
09:40:01    compiled SQL at target/compiled/fivetran_log/models/staging/stg_fivetran_log.yml/dbt_utils_unique_combination_o_124db0b56a6fac04fff9f44f342dcecf.sql

The root issue is that there actually is row duplication in the logs table created by the Fivetran Logs connector.

I first ran the SQL for the above dbt_utils_unique_combination_of_columns_stg_fivetran_log__log_log_id__created_at test failure which led me to a few specific duplicate rows:

    select
        log_id, 
        created_at,
        count(*)
    from analytics.dbt_jacob_mulligan_stg_fivetran_log.stg_fivetran_log__log
    group by log_id, created_at
    having count(*) > 1
;

I then ran this SQL to confirm that there are duplicate rows across the log_id and created_at columns in the raw log data that Fivetran has written into Snowflake:

    select *
    from raw.fivetran_logs.log
    where 
        id = 'logic_exporting'
        and time_stamp = '2022-02-09 18:52:45.842 +0000'

When I run this SQL I get the following output (sorry for the terrible output, I don't know how to format this better!) Notice that the only difference is in the SEQUENCE_NUMBER column, 16,112 vs 16,117.

TIME_STAMP	CONNECTOR_ID	TRANSFORMATION_ID	EVENT	MESSAGE_EVENT	MESSAGE_DATA	_FIVETRAN_SYNCED	ID	SYNC_ID	SEQUENCE_NUMBER	PROCESS_ID
2022-02-09 18:52:45.842 +0000	logic_exporting		INFO	copy_rows	{"schema":"fivetran_wick_crumpet_staging","name":"lucian_public_tasks-staging-4f8f229f-5b71-465a-8c71-fe4a47e25695"}	2022-02-09 19:16:13.412 +0000	logic_exporting	b5e47b18-dc88-4246-9f33-64512205e195	16,112	4e4d37a2-b9b9-43be-b9ea-45715f302903
2022-02-09 18:52:45.842 +0000	logic_exporting		INFO	copy_rows	{"schema":"fivetran_wick_crumpet_staging","name":"lucian_public_tasks-staging-4f8f229f-5b71-465a-8c71-fe4a47e25695"}	2022-02-09 19:16:13.412 +0000	logic_exporting	b5e47b18-dc88-4246-9f33-64512205e195	16,117	4e4d37a2-b9b9-43be-b9ea-45715f302903

Should the stg_fivetran_log__log model select the most recent row based on SEQUENCE_NUMBER?

Expected behavior
I would expect the stg_fivetran_log__log model (here) to handle duplicate rows. The model could either "just" select one of the two rows at random or (see above) grab the row with the larger SEQUENCE_NUMBER assuming that a later sequence number means the row came later and presumably is the "right" row to reference.

Project variables configuration

vars:
  fivetran_log:
    # Link to the package that relies on these vars:
    # https://hub.getdbt.com/fivetran/fivetran_log/latest/
    fivetran_log_database: raw
    fivetran_log_schema: fivetran_logs

models:
  fivetran_log:
    # Explicitly excluding because the runtime of this model is extremely slow,
    # it's taken as long as 14 minutes to build this one table. We don't use 
    # this model, therefore there's no reason to slow down build time for it
    fivetran_log__audit_table:
      +enabled: false

Package Version

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0

  - package: dbt-labs/segment
    version: 0.7.0

  - package: fivetran/fivetran_log
    version: 0.5.0

  - package: fivetran/salesforce_source
    version: 0.4.1

  - package: fivetran/salesforce
    version: 0.5.1

  - package: fivetran/jira
    version: 0.7.0

  - package: calogica/dbt_date
    version: 0.5.2

  - package: brooklyn-data/dbt_artifacts
    version: 0.6.0 #updated to 0.7.0, got a failed dbt test. checking in dbt slack about the error but reverting the change in the meantime back to 0.6.0

  - package: entechlog/dbt_snow_mask
    version: 0.1.6

  - package: fivetran/github
    version: 0.4.0

  - package: fivetran/github_source
    version: 0.4.1
  
  - package: tnightengale/dbt_meta_testing
    version: 0.3.5

Warehouse

Screenshots
The test failures in dbt Cloud:

Screenshot from Snowflake showing that the rows are identical except the sequence_number column

Please indicate the level of urgency
This is currently causing all of our production runs of dbt to "fail" because these tests don't pass. We also haven't been able to figure out how to overwrite the test severity to warn.
Low to moderate level of severity.

Are you interested in contributing to this package?

Yes, I can do this and open a PR for your review.
Possibly, but I'm not quite sure how to do this. I'd be happy to do a live coding session with someone to get this fixed.
No, I'd prefer if someone else fixed this. I don't have the time and/or don't know what the root cause of the problem is.

Answer 1 · 2022-02-14T12:35:28.000Z

This might not be the ideal solution, I'm not intimiately familiar with the log data provided by Fivetran and thus I cannot speak perfectly to why the rows are duplicated in the raw table in the first place. That said, if the `` file is updated to include this qualify statement then the failure goes away for me:

with log as (

    select * from {{ var('log') }}
    
),

fields as (

    select
        id as log_id, 
        time_stamp as created_at,
        connector_id, -- Note: the connector_id column used to erroneously equal the connector_name, NOT its id.
        case when transformation_id is not null and event is null then 'TRANSFORMATION'
        else event end as event_type, 
        message_data,
        case 
        when transformation_id is not null and message_data like '%has succeeded%' then 'transformation run success'
        when transformation_id is not null and message_data like '%has failed%' then 'transformation run failed'
        else message_event end as event_subtype,
        transformation_id

    from log

    qualify row_number() over (partition by id, created_at order by time_stamp desc) = 1
)

select * from fields

Answer 2 · 2022-02-14T17:02:48.000Z

Hi @jcbmllgn thanks so much for opening this issue and raising it to our attention. My original understanding is that these should be unique and I wonder what generates this SEQUENCE_NUMBER and why some records would have multiples. Let me follow up internally to understand this better.

However, if it does come out that these multiple sequences are appropriate then the proposed change you have above makes sense to me! I will hopefully follow up shortly with an update.

Answer 3 · 2022-02-14T19:14:27.000Z

Just adding that we have seen the exact same behaviour on our side and can confirm this is happening here as well.

Answer 4 · 2022-02-14T20:13:20.000Z

Thanks @NoToWarAlways and @jcbmllgn. I am now seeing the same behavior in my data as well.

The search continues 🔍

Answer 5 · 2022-02-15T18:09:23.000Z

Hi @jcbmllgn and @NoToWarAlways just wanted to provide an update that I have opened an internal ticket for our engineering teams to take a look at the connector and this change to provide a comprehensive understanding of this behavior. I will let you all know once we get more details!

Answer 6 · 2022-02-15T18:33:00.000Z

Thanks for the update, Joe!

…

On Tue, Feb 15, 2022 at 7:09 PM Joe Markiewicz ***@***.***> wrote: Hi @jcbmllgn <https://github.com/jcbmllgn> and @NoToWarAlways <https://github.com/NoToWarAlways> just wanted to provide an update that I have opened an internal ticket for our engineering teams to take a look at the connector and this change to provide a comprehensive understanding of this behavior. I will let you all know once we get more details! — Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJT5LXTHL5VLK5LZMZIEP3U3KJF3ANCNFSM5OLIURMA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 7 · 2022-02-16T11:22:10.000Z

I would also just want to add to this that we are seeing this behaviour as well. Glad it is being looked into.

Answer 8 · 2022-02-17T16:48:47.000Z

Hey All,

I have been able to work with our engineering team and found that this was a change on the connector end to address an edge case where two events are processed at the same time. However, since these duplicates have shown up it became apparent that this change should be reverted.

I have been made aware that the engineering team has reverted the changes on the connector. However, these records will still remain even if you perform a re-sync. With that, we discovered that these duplicates only occurred on February 9th and since the revert of the change there shouldn't be any other days that have these duplicates. Therefore, the best plan forward would be to delete these duplicate records from that day.

Once those records are deleted you should be able to see that the dbt test should no longer fail. Further, since this change has been reverted within the connector, the error would not show up again.

Please let me know if you have any questions!

Answer 9 · 2022-02-21T14:27:09.000Z

We also ran into this issue. We did as @fivetran-joemarkiewicz recommended and removed the records that was causing the issue. Working fine now.

Answer 10 · 2022-02-21T17:07:15.000Z

Hi,
here is the delete statement we used on our Snowflake database to deduplicate:

begin work;

delete
-- select f.*
from
    "DATABASE"."FIVETRAN_LOG"."LOG" as f
using 
-- join
(
    select 
        id, 
        time_stamp, 
        process_id,
        count(*)                as cnt,
        max(sequence_number)    as second_seq_num
    from 
        "DATABASE"."FIVETRAN_LOG"."LOG"
    group by
        id, time_stamp, process_id
    having
        count(*) > 1
) as d 
where
    f.id = d.id and 
    f.time_stamp = d.time_stamp and
    f.process_id = d.process_id and
    f.sequence_number = d.second_seq_num;

commit work;

Switch delete, using to select, join and do a test run before deleting!

Answer 11 · 2022-02-28T21:45:28.000Z

I got word from the engineering team that the issue that caused this duplication originally has been patched. Further, with the suggestion above, the solution to remove the duplicate records works as intended.

Closing this issue, feel free to re-open it if this continues to persist. Thanks everyone!