HTTPArchive/data-pipeline

Too many pages in `summary_requests`

rviscomi opened this issue · 1 comments

Ideally, we would expect to see every pageid from the summary_pages table in the summary_requests table, and no more.

However, it appears that there are significantly more pages represented in the requests table:

WITH pages AS (
  SELECT
    pageid
  FROM
    `httparchive.summary_pages.2022_09_01_mobile`
),

requests AS (
  SELECT
    pageid
  FROM
    `httparchive.summary_requests.2022_09_01_mobile`
)

SELECT
  COUNT(DISTINCT IF(pages.pageid IS NULL, requests.pageid, NULL)) AS extra_pageids
FROM
  requests
LEFT JOIN
  pages
USING
  (pageid)

The result is 13M, which suggests that we're accidentally including secondary pages in the summary tables.

Working theory: the summary requests logic is not filtering secondary pages due to the lack of a metadata attribute
https://github.com/HTTPArchive/data-pipeline/blob/main/modules/utils.py#L217-L219

This can be added when generating the requests object
https://github.com/HTTPArchive/data-pipeline/blob/main/modules/transformation.py#L216-L230

Took a copy of the existing table for confirmation: httparchive.experimental_gc_summary_requests.2022_09_01_mobile_copy_issue_153

@rviscomi would you mind deleting the unnecessary data in the meantime? (probably October too)