cal-itp/data-infra

Ingest and transform old style Benefits enrollments into Metabase

Closed this issue ยท 7 comments

User story / feature request

As a Metabase dashboard creator or consumer, I want to be able to see the full history of Benefits enrollments in a single format together in a single dataset/chart to avoid having to manually add to the numbers currently available.

Acceptance Criteria

  • Data from Amplitude is back-filled into the GCS data destination for the date range 2021-12-08 through 2022-08-11 Pacific

DMV events

This Amplitude chart shows events that match the following criteria that should be transformed and ingested into Metabase in the new format: https://app.amplitude.com/analytics/compiler/chart/aztqyfek

  • For all events before 2022-08-12 Pacific, of type viewed page with path = /enrollment/success, region = California or region = (none) and all values of city except Los Angeles, create a new event of type returned enrollment with status = success
  • Each event should be created in the mart_benefits.fct_benefits_events table according to the existing structure, and re-using as much relevant event information from the original viewed page event as possible (e.g. dates, times)
  • For each event created, assign the transit_agency properties the value of Monterey-Salinas Transit
  • For each event created, assign the eligibility_types properties the value of senior
  • For each event created, assign the eligibility_verifier properties the value of DMV
  • For each event created, assign the payment_group property the value of 5170d37b-43d5-4049-899c-b4d850e14990
  • There should be 18 new events created matching the city breakdown below:
City Total events
Salinas 8
San Jose 3
Monterey 2
(none) 2
Yuba City 1
Pacific Grove 1
San Francisco 1
Total 18

Login.gov events

This Amplitude chart shows events that match the following criteria that should be transformed and ingested into Metabase in the new format: https://app.amplitude.com/analytics/compiler/chart/l6lx6x7i

  • For all events including and after 2022-08-12 Pacific, of type viewed page with path = /enrollment/success, region = California or region = (none) and all values of city except Los Angeles, create a new event of type returned enrollment with status = success
  • Each event should be created in the mart_benefits.fct_benefits_events table according to the existing structure, and re-using as much relevant event information from the original viewed page event as possible (e.g. dates, times)
  • For each event created, assign the transit_agency properties the value of Monterey-Salinas Transit
  • For each event created, assign the eligibility_types properties the value of senior
  • For each event created, assign the eligibility_verifier properties the value of (MST) CDT claims via Login.gov
  • For each event created, assign the payment_group property the value of 5170d37b-43d5-4049-899c-b4d850e14990
  • There should be 4 new events created matching the city breakdown below:
City Total events
Monterey 4
Total 4

Notes

This came out of a discussion with @o-ram and @indexing as we start to roll out more public-facing Benefits metrics in e.g. the new Benefits newsletter, where numbers are based on the newer format for enrollment events, it is time-consuming and error-prone (and maybe not even possible from e.g. a Metabase chart perspective) to have to remember to add back these "old style" enrollments into the metrics.

The Benefits Timeline document helps explain some of the evolution in the Benefits app and reasons why there are differences in events captured during the "DMV days" when Benefits was hosted by the DMV and used driver's license information to verify eligibility for Older Adults.

TLDR; in cal-itp/benefits/pull/873 enrollments were defined as a specific event type rather than interpreted from a generic viewed page event type. This change occurred after the Benefits relaunch with CDT and Login.gov, so there are old enrollments from both the DMV/pre-Login.gov days, and after.

Implementation

I think this can be done in the main DBT model/script for fct_benefits_events.

We may need something like a separate WITH clause to create a new subquery/Common Table Expression (CTE) to select out the old events and transform them according to the criteria above, and then the final SELECT would need to be modified to JOIN the 2 CTEs together into the resultant table.

Query to get started

This query matches the conditions above for which enrollment events should be converted:

SELECT * FROM `cal-itp-data-infra.staging.stg_amplitude__benefits_events`
WHERE client_event_time >= '2021-12-08T08:00:00Z' 
  and client_event_time < '2022-08-29T07:00:00Z'
  and (region = 'California' or region is null)
  and city <> 'Los Angeles'
  and event_type = 'viewed page'
  and JSON_VALUE(event_properties, '$.path') = '/enrollment/success'

Slack thread to ensure this approach makes sense.

Backfill from Amplitude into the GCS data destination has started:

Image

The backfill finished today, exporting an additional 342K events.

Image

Confirming that data exists in GCP for this back-fill.

Bucket

E.g. for the beginning of the date range:

image

And the end:

image

Staging table

Running this query in the console:

SELECT city, count(client_event_time) as enrollments
FROM `cal-itp-data-infra.staging.stg_amplitude__benefits_events`
WHERE client_event_time >= '2021-12-08T08:00:00Z' 
  and client_event_time < '2022-08-29T07:00:00Z'
  and (region = 'California' or region is null)
  and (city <> 'Los Angeles' or city is null)
  and event_type = 'viewed page'
  and JSON_VALUE(event_properties, '$.path') = '/enrollment/success'
GROUP BY city
ORDER BY enrollments desc, city

combining the date criteria above

Produces the expected results:

city enrollments
Salinas 8
Monterey 6
San Jose 3
null 2
Pacific Grove 1
San Francisco 1
Yuba City 1
Total 22

Query I'm starting with to process the two groups (DMV and Login.gov) separately before combining:

WITH 
fct_old_enrollments AS (
  SELECT *
  FROM `cal-itp-data-infra.staging.stg_amplitude__benefits_events`
  WHERE client_event_time >= '2021-12-08T08:00:00Z' 
    and client_event_time < '2022-08-29T07:00:00Z'
    and (region = 'California' or region is null)
    and (city <> 'Los Angeles' or city is null)
    and event_type = 'viewed page'
    and JSON_VALUE(event_properties, '$.path') = '/enrollment/success'
  ORDER BY client_event_time desc
),
fct_dmv_events AS (
  SELECT 
    app,
    device_id,
    user_id,
    client_event_time,
    event_id,
    session_id,
    "returned enrollment" as event_type,
    version_name,
    os_name,
    os_version,
    device_family,
    device_type,
    country,
    language,
    library,
    city,
    region,
    event_time,
    client_upload_time,
    server_upload_time,
    server_received_time,
    amplitude_id,
    start_version,
    uuid,
    processed_time,
    "senior" as event_properties_eligibility_types,
    "DMV" as event_properties_eligibility_verifier,
    "5170d37b-43d5-4049-899c-b4d850e14990" as event_properties_payment_group,
    "success" as event_properties_status,
    "Monterey-Salinas Transit" as event_properties_transit_agency,
    "senior" as user_properties_eligibility_types,
    "DMV" as user_properties_eligibility_verifier,
    "Monterey-Salinas Transit" as user_properties_transit_agency
  FROM fct_old_enrollments
  WHERE client_event_time < '2022-08-12T07:00:00Z'
),
fct_login_events AS (
  SELECT
    app,
    device_id,
    user_id,
    client_event_time,
    event_id,
    session_id,
    "returned enrollment" as event_type,
    version_name,
    os_name,
    os_version,
    device_family,
    device_type,
    country,
    language,
    library,
    city,
    region,
    event_time,
    client_upload_time,
    server_upload_time,
    server_received_time,
    amplitude_id,
    start_version,
    uuid,
    processed_time,
    version_name,
    platform,
    os_name,
    os_version,  
    "senior" as event_properties_eligibility_types,
    "OAuth claims via Login.gov" as event_properties_eligibility_verifier,
    "5170d37b-43d5-4049-899c-b4d850e14990" as event_properties_payment_group,
    "success" as event_properties_status,
    "Monterey-Salinas Transit" as event_properties_transit_agency,
    "senior" as user_properties_eligibility_types,
    "OAuth claims via Login.gov" as user_properties_eligibility_verifier,
    "Monterey-Salinas Transit" as user_properties_transit_agency
  FROM fct_old_enrollments
  WHERE client_event_time >= '2022-08-12T07:00:00Z'
)

SELECT * FROM fct_dmv_events
UNION DISTINCT
SELECT * FROM fct_login_events

This is producing the expected 22 records in the new enrollment format

Keeping this In review while we wait for the warehouse refresh and can confirm the new data appears in Metabase.

@thekaveman Legacy enrollments for older adults are now appearing in Metabase. Total enrollment count is now accurate as well. Woot woot! As far as I know, we didn't modify the query; they simply appeared this morning. cc// @o-ram

Confirming the 18 DMV-era events are represented:

image

As well as the 4 Login.gov / pre-analytics-migration events:

image

For a total of 22 new (old) enrollments!