fivetran/dbt_zendesk

[Question] Duplicate SLA Event IDs

Closed this issue · 15 comments

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

Hey all,

First all thanks so much for a great package. I unfortunately couldn't use it in my environment out of the box but was able to get far enough in converting most of the code to Snowflake SQL to suffice! However, I am running into one unexpected issue.

In zendesk_sla_policies I am getting duplicate sla_events_ids. Here is an example:

  • Ticket's schedule is 24/5 M-F
  • Ticket Has Business Hour SLA Applied At 01:29 AM UTC
  • Next SLA Schedule Start At occurs at 04:00 AM UTC (12 AM ET)
  • Agent Replies at 1:33 AM UTC

Because the filtered_reply_times CTE evaluates if the agent_reply_at date is the same as both the sla_schedule_start_at dates both rows are retained and are passed through to the zendesk_sla_policies table.

, filtered_reply_times as (
  select * 
  from lagging_time_block
  where {{ dbt.date_trunc("day", "cast(agent_reply_at as date)") }} = {{ dbt.date_trunc("day", "cast(sla_schedule_start_at as date)") }}
    or ({{ dbt.date_trunc("day", "cast(agent_reply_at as date)") }} < {{ dbt.date_trunc("day", "cast(sla_schedule_start_at as date)") }} and sum_lapsed_business_minutes_new = 0 and sla_breach_at = first_sla_breach_at)

)

While I know I'm not using the out of the box version of the code and thus this may be user error. I was wondering if somebody may be able to point how the code is supposed to solve for this. type of scenario.

The only thing I can think of is perhaps is_breached_during_schedule was supposed to be used to indicate which row to retain. Feels odd that work goes into defining that but is never used.

Relevant error log or model output

No response

Expected behavior

Would expect it to only return one row

dbt Project configurations

I don't have this :(

Package versions

I don't have this :(

What database are you using dbt with?

snowflake

dbt Version

I don't have this :(

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.

hey there @nschimmoller 👋

thanks for taking the time to make this issue! it sounds like this could be related to #107, in which the intercepted_periods CTE in the business hours model may be problematic.... are you also seeing the duplicates in int_zendesk__reply_time_business_hours, or just the combined model?

@fivetran-jamie thanks for getting back to me I'll try and add some more detail for troubleshooting.

It sounds like this could be related to #107

Unfortunately I do not believe this to be the case as my issue does not have do with a period spanning the changing of a schedule's valid_from and valid_to times

Are you also seeing the duplicates in int_zendesk__reply_time_business_hours, or just the combined model?

Yes I am seeing multiple records for ticket_id 1609259, results of which can be seen here
reply_time_business_hours.csv

However, I believed this to be intentional as there is no overt filtering function used in the int_zendesk__reply_time_hours.sql but there is in the int_zendesk__reply_time_combined code with the filtered_reply_times CTE.

However, the output from the int_zendesk__reply_time_combined code also generates two rows for the same ticket_id 1609259'. reply_time_combined.csv

Importantly though neither of these three rows report the correct sla_schedule_start_at.

This ticket was created on Wednesday 2023-08-02 21:01:35.000 +0000. Our schedule during the week is 04:00:00.000 +0000 - 04:00:00.000 +0000, or in other words open 24/5 with our business days starting at 00:00:00.000 -0400 or (US ET or Americas/New_York).

Given that this Ticket came in during "open" business hours the sla_schedule_start_at should start immediately (2023-08-02 21:01:35.000 +0000). However, as you can see in the results of reply_time_combined the earliest sla_schedule_start_at applied is `2023-08-03 04:00:00.000 +0000.

Digging into the code I believe this to be do to the filtering used in the aforementioned filtered_reply_times CTE. Relevant code:

109 | where {{ dbt.date_trunc("day", "cast(agent_reply_at as date)") }} = {{ dbt.date_trunc("day", "cast(sla_schedule_start_at as date)") }} or ({{ dbt.date_trunc("day", "cast(agent_reply_at as date)") }} < {{ dbt.date_trunc("day", "cast(sla_schedule_start_at as date)") }} and sum_lapsed_business_minutes_new = 0 and sla_breach_at = first_sla_breach_at)

This can be summarized as:

  • Retain rows where the date of agent_reply_at matches that of the sla_schedule_start_at field
    OR
  • Retain rows where the date of agent_reply_at is less than that of the sla_schedule_start_at field, given that the elapsed business minutes is 0.

However, this does not account for the scenario in which the agent responds on a day after the correct sla_schedule_start_at is applied. By modifying the filtered_reply_times CTE as seen below I can force this row's inclusion. However, it still returns more than one row, in this case 3. filtered_reply_times_fix.csv

filtered_reply_times as (
    SELECT * 
    FROM lagging_time_block
    WHERE DATE_TRUNC('DAY', CAST(agent_reply_at AS DATE)) = DATE_TRUNC('DAY', CAST(sla_schedule_start_at AS DATE))
       OR (
           agent_reply_at >= sla_applied_at 
           AND agent_reply_at <= sla_schedule_end_at 
           AND DATE_TRUNC('DAY', CAST(agent_reply_at AS DATE)) = DATE_TRUNC('DAY', CAST(sla_schedule_end_at AS DATE))
       )
       OR (
           DATE_TRUNC('DAY', CAST(agent_reply_at AS DATE)) < DATE_TRUNC('DAY', CAST(sla_schedule_start_at AS DATE))
           AND sum_lapsed_business_minutes_new = 0
           AND sla_breach_at = first_sla_breach_at
       )
)

At this point we finally have the right row generating as a result of the int_zendesk__reply_times_combined but there are still erroneous rows that a way to drop needs to be found.

You can also see in the most recent file that the sla_elapsed_time is not reporting correctly. I believe the root cause of this is the lagging_time_block's calculation of sum_lapsed_business_minutes_new

coalesce(lag(sum_lapsed_business_minutes) over (partition by sla_policy_name, metric, sla_applied_at order by sla_breach_at), 0) as sum_lapsed_business_minutes_new`

The issue being that all of the rows have the same values for sla_breach_at so the actual ordering of the rows is likely arbitrary. This one could be a difference in how DBT and Snowflake treat tie breaking of order by statements. However, I've found that by using the following statement I can ensure the right order and the right data

coalesce(lag(sum_lapsed_business_minutes) over (partition by sla_policy_name, metric, sla_applied_at order by sla_schedule_start_at), 0) as sum_lapsed_business_minutes_new

I just set up the fivetran/dbt_zendesk package and am also running into this issue

Thanks for chiming in @trunsky and sharing that you are also experiencing this issue. My team and I are currently scoping this issue out and hope to share some insights in the next day or so. Be sure to follow here for updates!

Hey all,

Wanted to reach back out with an update on a handful of changes I made in my environment that was able to resolve the issues described above:

int_zendesk__reply_time_business_hours.sql

intercepted_periods_with_breach_flag as (
  select 
    *,
    target - sum_lapsed_business_minutes as remaining_minutes,
    {{ fivetran_utils.timestamp_add(
        "minute",
        "cast(((7*24*60) * week_number) + (schedule_end_time) as " ~ dbt.type_int() ~ " )",
        "cast(" ~ dbt_date.week_start('sla_applied_at','UTC') ~ " as " ~ dbt.type_timestamp() ~ ")" ) }} as sla_schedule_end_at,
    case when (target - sum_lapsed_business_minutes) < 0 
      and 
        (lag(target - sum_lapsed_business_minutes) over
        (partition by ticket_id, metric, sla_applied_at order by week_number, schedule_start_time) >= 0 
        or 
        lag(target - sum_lapsed_business_minutes) over
        (partition by ticket_id, metric, sla_applied_at order by week_number, schedule_start_time) is null) 
        then true else false end as is_breached_during_schedule -- this flags the scheduled period on which the breach took place
  from intercepted_periods
)

Added a column for sla_schedule_end_at that will be used later for selecting rows to retain.

reply_time_business_hours_sla as (
  select
    ticket_id,
    sla_policy_name,
    metric,
    ticket_created_at,
    sla_applied_at,
    greatest(sla_applied_at,sla_schedule_start_at) as sla_schedule_start_at,
    sla_schedule_end_at,
    target,
    sum_lapsed_business_minutes,
    in_business_hours,
    sla_breach_at,
    is_breached_during_schedule
  from intercepted_periods_with_breach_flag_calculated
) 

Passing through ticket_created_at, to be used for filtering later, along with the sla_schedule_end_at mentioned previously

int_zendesk__reply_time_combined.sql

  select 
    ticket_id,
    sla_policy_name,
    metric,
    ticket_created_at,
    sla_applied_at,
    sla_schedule_start_at,
    cast(null as timestamp) as sla_schedule_end_at,
    cast(null as {{ dbt.type_numeric() }}) as sum_lapsed_business_minutes,
    target,
    in_business_hours,
    sla_breach_at
  from reply_time_calendar_hours_sla

{% if var('using_schedules', True) %}

  union all

  select 
    ticket_id,
    sla_policy_name,
    metric,
    ticket_created_at,
    sla_applied_at,
    sla_schedule_start_at,
    sla_schedule_end_at,
    sum_lapsed_business_minutes,
    target,
    in_business_hours,
    sla_breach_at
  from reply_time_business_hours_sla
{% endif %}

)

Brining in ticket_created_at and sla_schedule_end_at columns for filtering.

reply_time_breached_at_with_next_reply_timestamp as (

  select 
    reply_time_breached_at.ticket_id,
    reply_time_breached_at.sla_policy_name,
    reply_time_breached_at.metric,
    reply_time_breached_at.ticket_created_at,
    reply_time_breached_at.sla_applied_at,
    reply_time_breached_at.sum_lapsed_business_minutes,
    reply_time_breached_at.target,
    reply_time_breached_at.in_business_hours,
    min(reply_time_breached_at.sla_schedule_start_at) as sla_schedule_start_at,
    min(reply_time_breached_at.sla_schedule_end_at) as sla_schedule_end_at,
    min(sla_breach_at) as sla_breach_at,
    min(reply_at) as agent_reply_at,
    min(solved_at) as next_solved_at
  from reply_time_breached_at
  left join reply_time
    on reply_time.ticket_id = reply_time_breached_at.ticket_id
    and reply_time.reply_at > reply_time_breached_at.sla_applied_at
  left join ticket_solved_times
    on reply_time_breached_at.ticket_id = ticket_solved_times.ticket_id
    and ticket_solved_times.solved_at > reply_time_breached_at.sla_applied_at
  {{ dbt_utils.group_by(n=8) }}

)

Continuing to pass through ticket_created_at and sla_schedule_end_at columns for filtering.

 lagging_time_block as (

  select 
    *,
    min(sla_breach_at) over (partition by metric, sla_applied_at order by sla_breach_at rows unbounded preceding) as first_sla_breach_at,
    min(sla_schedule_start_at) over (partition by ticket_id, metric, sla_applied_at) as sla_measured_from,
		coalesce(lag(sum_lapsed_business_minutes) over (partition by metric, sla_applied_at order by sla_schedule_start_at), 0) as sum_lapsed_business_minutes_new
  from reply_time_breached_at_with_next_reply_timestamp

)

Updated order by logic for sum_lapsed_business_minutes_new as prior logic resulted in arbitrary ordering.

filtered_reply_times as (

  select * 
  from lagging_time_block
  where 
    (
      in_business_hours = true
      and
      (
        agent_reply_at between sla_schedule_start_at and sla_schedule_end_at
        or 
        (
          agent_reply_at < sla_schedule_start_at
          and sum_lapsed_business_minutes_new = 0 
          and sla_breach_at = first_sla_breach_at
        )
      )
    )
    or
    (
      in_business_hours = false
      and
      (
        {{ dbt.date_trunc("day", "cast(agent_reply_at as date)") }} = {{ dbt.date_trunc("day", "cast(sla_schedule_start_at as date)") }}
        or 
        (
          agent_reply_at < sla_schedule_start_at
          and sum_lapsed_business_minutes_new = 0 
          and sla_breach_at = first_sla_breach_at
        )
      )
    )
    or
    (
      agent_reply_at is null
      and sla_applied_at >= ticket_created_at
    )

)

Problem:

  1. Does not include responses that occur within a single schedule but on the next day

Prior filtering only retained rows where agent_reply_at and sla_schedule_start at occured on the same day, or agent_reply_at occurred prior to the sla_schedule_start_at and the sum_lapsed_business_minutes_new was 0. In otherwords when a ticket was replied to before opening hours.

However, this misses the common scneario of when a ticket was replied to after the day of the sla_schedule_start, but prior to the sla_schedule_end at.

  1. Does not include:
  • Responses that came in prior to sla_schedule_start but on same day
  • Tickets still awaiting a response

Changes:

  1. Solve Problem 1

Replaced logic to retain rows where agent_reply_at and sla_schedule_start at occurred on the same day with logic to capture rows where the agent_reply_at occurs between the start and the end of the schedule.

  1. Solve Problem 2
  • Removed date_trunc statements when evaluating agent_reply_at < sla_schedule_start_at

  • Added logic for agent_reply_at is null, but ensuring that sla_applied_at >= ticket_created_at
    -- Correct row will then be selected in reply_time_breach CTE with row_number() function

reply_time_breached_at_remove_old_sla as (

  select 
    *,
    lead(sla_applied_at) over (partition by ticket_id, metric, in_business_hours order by sla_schedule_start_at) as updated_sla_policy_starts_at,
    row_number() over (partition by ticket_id, sla_policy_name, metric, in_business_hours order by sla_schedule_start_at) as row_num,
    case when 
      lead(sla_applied_at) over (partition by ticket_id, metric, in_business_hours order by sla_schedule_start_at)
      < sla_breach_at then true else false end as is_stale_sla_policy,
    case when (sla_breach_at < agent_reply_at and sla_breach_at < next_solved_at)
                or (sla_breach_at < agent_reply_at and next_solved_at is null)
                or (agent_reply_at is null and sla_breach_at < next_solved_at)
                or (agent_reply_at is null and next_solved_at is null)
      then true
      else false
        end as is_sla_breached
  from filtered_reply_times
  where sla_schedule_end_at > sla_schedule_start_at

)

Problem:

  1. Similar to problem in lagging_time_block CTE above code was previously ordering by sla_applied_at, making the resulting ordering arbitrary. Ordering by sla_schedule_start at ensures that is_stale_sla_policy is applied to the correct rows.
  2. There was no way to deduplicate rows at the end, by using a row_number window function here we can ensure we retain only 1 row per ticket_id and metric combination.
  3. Some resulting rows would have values for sla_schedule_end_at greater than their values for sla_schedule_start at which should never be possible.

Changes:

  1. Updated window functions ordered by logic to use sla_schedule_start_at (Solves problem 1)
  2. Added filter logic to remove rows with invalid sla_schedule_end_at and sla_schedule_start_at

Thoughts:

Truthfully, I'm not entirely sure what purpose or function updated_sla_policy_starts_at or is_stale_sla_policy serves. I'm assuming that since the logic used here I was able to identify as being problematic elsewhere the same changes were needed.

reply_time_breach as (

  select 
    *,
    case when {{ dbt.datediff("sla_schedule_start_at", "agent_reply_at", 'minute') }} < 0 
      then 0 
      else sum_lapsed_business_minutes_new + {{ dbt.datediff("sla_schedule_start_at", "agent_reply_at", 'minute') }} 
    end as sla_elapsed_time
  from reply_time_breached_at_remove_old_sla
  where row_num = 1

)

Use row_num to filter to only one row per ticket metric combination

@nschimmoller thank you so much for the detailed write up and sharing the code you were able to apply to address the duplicates. I am curious if this fixed all the duplicates and did not introduce any other issues? I attempted to make the same updates (you can see them in the bugfix/duplciate-sla-policies branch I created); however, I found this reduced the record count of calendar metric tickets in my zendesk__sla_policies model from +100k to just around 1k. It seems that the logic introduced in v0.11.0 and that you are enhancing in your above code suggestions is impacting the non business hour tickets.

Nevertheless, I believe you are onto something with calling out that the additional logic within the int_zendesk__reply_time_combined model (particularly these ctes) is not accounting for the majority of common behaviors when it comes to SLAs. Our team is still exploring this and using your suggestions to help identify a possible code update to correct the duplicates in the business hour SLA and address the bug in the non business hour sla tickets.

In the meantime, I am curious if you experience these same issues in the previous minor version of the package v0.10.2? This version does not account for schedule holidays, but I am curious if the duplicates you see are resolved? You can leverage this version by downgrading the package in your packages.yml. If you do not see any errors with this version then that will really help us identify the problem code and ensure we are able to find a viable solution. @nschimmoller I know you said you were unable to leverage the package out of the box, but you can see the code differences between the versions here.

packages:
  - package: fivetran/zendesk
    version: [">=0.10.0", "<0.11.0"]

Further, would either @trunsky or @nschimmoller be interested in meeting with a member of my team to troubleshoot in more detail so we may understand the issue further and collaborate on a viable solution? Unfortunately, I will be out on PTO for the next few weeks, but @fivetran-jamie from my team will be able to continue scoping this out further.

Thanks!

Confirming that downgrading the package resolved the test error for me. I'm probably the wrong person to troubleshoot this data, as pulling in zendesk slas was a request from someone on our team, and I'd be lying if I fully understand the Zendesk data at this point 😅.

@fivetran-joemarkiewicz can we setup some time to walk through this early next week?

@nschimmoller yes definitely, what days/times work for y'all?

@fivetran-jamie I'm free any time after noon ET next Monday (9/11). Does that work for you at all?

@nschimmoller works perfect for me, I'm free all day Monday! Thanks

If you'd like to send out the invite, my email is jamie.rodriguez@fivetran.com

Hi @nschimmoller you can include me as well! renee.li@fivetran.com

Hi @nschimmoller again, we appreciate you opening this up and working with us through the fix! This has been included in our latest release of the package and as such we'll be closing this out.