August "combined" tables contain duplicates
rviscomi opened this issue · 5 comments
Looks like we need to dedupe all of the "combined" pipeline tables: lighthouse, pages, etc.
Also, how did this happen? Looking at the table info it seems like the tables were created on the 14th and updated on the 17th. Maybe we skipped a truncate step when rerunning the pipeline.
I'll take this one so I can get the August reports run.
All fixed apart form mobile response bodies (still running).
These were the pages that needed fixing:
- httparchive.response_bodies.2022_08_01_mobile
- httparchive.response_bodies.2022_08_01_desktop
- httparchive.summary_requests.2022_08_01_mobile (desktop was fine)
- httparchive.pages.2022_08_01_desktop
- httparchive.pages.2022_08_01_mobile
- httparchive.requests.2022_08_01_desktop
- httparchive.requests.2022_08_01_mobile
- httparchive.technology.2022_08_01_desktop
- httparchive.technology.2022_08_01_mobile
- httparchive.lighthouse.2022_08_01_desktop
- httparchive.lighthouse.2022_08_01_mobile
Here's the steps I took for each table:
-- Check for duplicate rows:
select 'dedup', count(pageis) from `httparchive.pages.2022_08_01_mobile`
union all
select 'current', count(0) from `httparchive.pages.2022_08_01_mobile`;
-- Dedup using scratchspace
CREATE OR REPLACE TABLE `httparchive.scratchspace.pages_2022_08_01_mobile_dedup`
AS
SELECT DISTINCT * FROM `httparchive.pages.2022_08_01_mobile`
-- Confirm dedup created successfully and diffs:
select 'dedup', count(0) from `httparchive.scratchspace.pages_2022_08_01_mobile_dedup`
union all
select 'current', count(0) from `httparchive.pages.2022_08_01_mobile`;
-- truncate table (commented out - just in case!)
-- truncate table `httparchive.pages.2022_08_01_mobile`;
-- repopulate depuded data
insert into `httparchive.pages.2022_08_01_mobile`
select * from `httparchive.scratchspace.pages_2022_08_01_mobile_dedup`
-- Confirm no diffs:
select 'dedup', count(0) from `httparchive.scratchspace.pages_2022_08_01_mobile_dedup`
union all
select 'current', count(0) from `httparchive.pages.2022_08_01_mobile`;
-- drop scratch table
drop table `httparchive.scratchspace.pages_2022_08_01_mobile_dedup`@rviscomi / @giancarloaf are we just assuming this was manual error or is there anything else we want to investigate here? Or happy for me to close out this issue once I get mobile response bodies sorted too?
Thanks @tunetheweb! Before closing I think it would be good just to verify that we know how we ended up with duplicates. It might also be worth considering adding checks to the data pipeline to prevent it. I think we consciously switched from truncate to append write mode, so maybe we can't switch back, but we may still want to have a check to ensure that the table doesn't exist or is at least empty before starting the processing/appending. I'll reassign to @giancarloaf to answer/investigate that.
Confirmed this is safe to ignore after investigating some of the previous job history. This was a result of a combination of partial table loads from the streaming pipeline (no longer used) and a full batch job appending to the final table destination.
It's safe to close this issue. We should not expect duplicates like this in the future. If duplicates are found in future runs, they should be investigated separately.
Cool thanks for looking into it
