HTTPArchive/data-pipeline

Crawlid to be continued after EOL of batching

Themanwithoutaplan opened this issue · 5 comments

For any kind of historical report it's useful to be able to use the crawlid and, therefore, it would be good to have this incremented for the relevant summary reports and there is no other candidate key within the data (report label, browser type).

@Themanwithoutaplan what is it required for?

As I said, for any kind of historical report.

I think the date of the crawl would be a more meaningful dimension for historical analysis (browser type is also available - client).
Especially when you want to put your metrics on a timescale.

Example:

-- Query estimated 139 Gb
WITH pages AS (
  SELECT
    date,
    client,
    CAST(JSON_VALUE(summary, '$.bytesTotal') AS INT64) AS page_weight
  FROM
    `httparchive.all.pages` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date >= '2023-09-01' AND
    is_root_page
)

SELECT
  date,
  client,
  APPROX_QUANTILES(page_weight, 1000)[OFFSET(500)] AS median_page_weight
FROM
  pages
GROUP BY
  date,
  client

@Themanwithoutaplan does the example and the key date+client+page satisfy your analysis requirements?
If not, please share more.

Closing as there are alternatives to identify unique crawls along the historical data.