Too many pages in `summary_requests`
rviscomi opened this issue · 1 comments
Ideally, we would expect to see every pageid from the summary_pages table in the summary_requests table, and no more.
However, it appears that there are significantly more pages represented in the requests table:
WITH pages AS (
SELECT
pageid
FROM
`httparchive.summary_pages.2022_09_01_mobile`
),
requests AS (
SELECT
pageid
FROM
`httparchive.summary_requests.2022_09_01_mobile`
)
SELECT
COUNT(DISTINCT IF(pages.pageid IS NULL, requests.pageid, NULL)) AS extra_pageids
FROM
requests
LEFT JOIN
pages
USING
(pageid)The result is 13M, which suggests that we're accidentally including secondary pages in the summary tables.
Working theory: the summary requests logic is not filtering secondary pages due to the lack of a metadata attribute
https://github.com/HTTPArchive/data-pipeline/blob/main/modules/utils.py#L217-L219
This can be added when generating the requests object
https://github.com/HTTPArchive/data-pipeline/blob/main/modules/transformation.py#L216-L230
Took a copy of the existing table for confirmation: httparchive.experimental_gc_summary_requests.2022_09_01_mobile_copy_issue_153
@rviscomi would you mind deleting the unnecessary data in the meantime? (probably October too)