HTTPArchive/data-pipeline

Secondary pages marked as root pages in parsed CSS table

rviscomi opened this issue · 1 comments

In the experimental_parsed_css.2022_07_01_* tables, home and secondary pages are included and both are marked with is_root_page set to true.

Secondary pages should have this field set to false.

Because of this bug, secondary pages are mistakenly included in the almanac.parsed_css table.

Overwriting the existing tables with BQ DML

UPDATE
  `httparchive.experimental_parsed_css.2022_07_01_mobile`
SET
  is_root_page = FALSE
WHERE
  page NOT IN (SELECT url AS page FROM `httparchive.summary_pages.2022_07_01_mobile`)

Also need to fix the data pipeline to use the correct is_root_page value.