HTTPArchive/data-pipeline

Account for redirects when selecting secondary page URL

rviscomi opened this issue · 2 comments

The secondary page for the Web Almanac site is the home page itself:

SELECT
  is_root_page,
  page
FROM
  `httparchive.all.pages`
WHERE
  date = '2022-08-01'
  AND client = 'desktop'
  AND root_page = 'https://almanac.httparchive.org/'

image

The root page redirects to the 2021 edition, and the largest anchor seems to be the logo, which points back to the 2021 home page, so we seem to have a duplicate page with pre- and post-redirect URLs.

@pmeenan WDYT about this one?

It currently checks against the url that the test originally navigated to. It should be trivial to modify the custom metric to also exclude based on the current location. I think it's all within the custom metric since it is about the final URL, not the initial URL.

One thing we won't be able to detect though is other URLs that redirect back to the same URL (i.e. if /en/2021/ redirected to /).