HTTPArchive/data-pipeline

Page summary reports contain duplicates

Themanwithoutaplan opened this issue · 4 comments

I only discovered this by chance and I'm not sure how many other instances there are of but at least in the 2022-06-01 desktop query the test 220609_Dx5G_1613T (www.jnj.com) appears twice and this shouldn't be the case.

Agreed, something doesn't look right. There are 3,252 duplicate WPT IDs in the 2022_06_01_desktop table.

It's straightforward to remove the duplicates but we should also try to understand how it happened. FWIW there are 0 duplicates in the July table, so this may be unique to the streaming pipeline.

2022-06 mobile - 3946 duplicates

bquxjob_14def594_1830fe5f0ad.csv

SELECT wptid, COUNT(1) as count_wptids, ARRAY_AGG(DISTINCT url) as urls, ARRAY_LENGTH(ARRAY_AGG(DISTINCT url)) as count_urls
FROM `httparchive.summary_pages.2022_06_01_mobile`
GROUP BY 1
HAVING COUNT(1) > 1
ORDER BY 2 DESC

2022-06 desktop - 3252 duplicates

bquxjob_16304447_1830fe8e23a.csv

SELECT wptid, COUNT(1) as count_wptids, ARRAY_AGG(DISTINCT url) as urls, ARRAY_LENGTH(ARRAY_AGG(DISTINCT url)) as count_urls
FROM `httparchive.summary_pages.2022_06_01_desktop`
GROUP BY 1
HAVING COUNT(1) > 1
ORDER BY 2 DESC

2022-05 desktop - 87 duplicates

bquxjob_60149832_1830ffc8781.csv

SELECT wptid, COUNT(1) as count_wptids, ARRAY_AGG(DISTINCT url) as urls, ARRAY_LENGTH(ARRAY_AGG(DISTINCT url)) as count_urls
FROM `httparchive.summary_pages.2022_05_01_desktop`
GROUP BY 1
HAVING COUNT(1) > 1
ORDER BY 2 DESC

These older jobs are harder to research, but similar to #133, this is likely due to issues encountered when implementing the streaming pipeline around that time. Since we will no longer use the streaming approach to loading BigQuery, this should not be expected in the future.

I have deduplicated these tables which were loaded around the time streaming was still in use:

  • httparchive.summary_pages.2022_06_01_mobile
  • httparchive.summary_pages.2022_06_01_desktop
  • httparchive.summary_pages.2022_05_01_desktop (mobile had no duplicates)
  • httparchive.summary_requests.2022_06_01_mobile
  • httparchive.summary_requests.2022_06_01_desktop
  • httparchive.summary_requests.2022_05_01_desktop
  • httparchive.summary_requests.2022_05_01_mobile

Great thanks @giancarloaf. Looks like we can resolve this.