Reorganize the BigQuery datasets to be more efficient
Closed this issue · 15 comments
Similar to the almanac dataset, we want the monthly results to be queryable in partitioned and clustered tables.
We'll need a deprecation plan to avoid suddenly breaking existing queries:
- create the new tables
- backfill with existing data
- alter the pipeline to write to the new tables in addition to the existing tables #38
- document how to query the new tables
- announce the deprecation plan
- migrate existing queries to the new tables (i.e. for
generate_reports) - wait N months to allow the community to upgrade their queries
- stop writing to the old tables in the pipeline
- delete the old tables
Brainstorming new tables/schemas.
httparchiveallpagesrequests
httparchive.all.pages
All-in-one table containing all page-based results from all crawls for all clients.
Partition by:
date(required)
Cluster by:
clientis_root_pagerank
Combines data from multiple existing tables:
pagessummary_pagesblink_featurestechnologieslighthouse
Additional enhancements:
- New
custom_metricsfield featuresandtechnologiesfields are well-defined repeated structs- Combining
categoriesinto a single repeated STRING field, rather than creating new technology+category pairs, which tripped up many users (including myself)
- Combining
summaryfield is JSON-encoded so we can support all historicalsummary_pagesstats if we'd like and add/remove anything over time without breaking the schema
Schema
[
{
"name": "date",
"type": "DATE",
"mode": "REQUIRED",
"description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
},
{
"name": "client",
"type": "STRING",
"mode": "REQUIRED",
"description": "Test environment: desktop or mobile"
},
{
"name": "page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the page being tested"
},
{
"name": "is_root_page",
"type": "BOOLEAN",
"mode": "REQUIRED",
"description": "Whether the page is the root of the origin"
},
{
"name": "root_page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the root page being tested, the origin followed by /"
},
{
"name": "rank",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "Site popularity rank, from CrUX"
},
{
"name": "wptid",
"type": "STRING",
"mode": "NULLABLE",
"description": "ID of the WebPageTest results"
},
{
"name": "payload",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded WebPageTest results for the page"
},
{
"name": "summary",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded summarization of the page-level data"
},
{
"name": "custom_metrics",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded test results of the custom metrics"
},
{
"name": "lighthouse",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded Lighthouse report"
},
{
"name": "features",
"type": "RECORD",
"mode": "REPEATED",
"description": "Blink features detected at runtime (see https://chromestatus.com/features)",
"fields": [
{
"name": "feature",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature name"
},
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature ID"
},
{
"name": "type",
"type": "STRING",
"mode": "NULLABLE",
"description": "Blink feature type (css, default)"
}
]
},
{
"name": "technologies",
"type": "RECORD",
"mode": "REPEATED",
"description": "Technologies detected at runtime (see https://www.wappalyzer.com/)",
"fields": [
{
"name": "technology",
"type": "STRING",
"mode": "NULLABLE",
"description": "Name of the detected technology"
},
{
"name": "categories",
"type": "STRING",
"mode": "REPEATED",
"description": "List of categories to which this technology belongs"
},
{
"name": "info",
"type": "STRING",
"mode": "REPEATED",
"description": "Additional metadata about the detected technology, ie version number"
}
]
},
{
"name": "metadata",
"type": "STRING",
"mode": "NULLABLE",
"description": "Additional metadata about the test"
}
]Example queries
Median page weight over time
SELECT
# No need to parse the date or client from _TABLE_SUFFIX
date,
client,
# Existing summary stats can be extracted from the JSON-encoded `summary` field
APPROX_QUANTILES(JSON_VALUE(summary, '$.bytesTotal'), 1000)[500] AS median_page_weight
FROM
# New table address
`httparchive.all.pages`
WHERE
# The table is partitioned by date, so we don't incur any costs for data older than 2020
date >= '2020-01-01' AND
# Only measure root/home pages for consistency
is_root_page
GROUP BY
date,
client
ORDER BY
date,
clienthttparchive.all.requests
All-in-one table containing all request-based results for all crawls and clients.
Partition by:
date(required)
Cluster by:
clientis_root_pageis_main_documenttype
Combines data from multiple existing tables:
requestssummary_requests
Additional enhancements:
- Top-level
request_headersandresponse_headersfields that are well-defined repeated structs of key/value pairs, borrowing from thealmanac.summary_response_bodiestable summaryfield is JSON-encoded so we can support all historicalsummary_requestsstats if we'd like and add/remove anything over time without breaking the schema
Schema
[
{
"name": "date",
"type": "DATE",
"mode": "REQUIRED",
"description": "YYYY-MM-DD format of the HTTP Archive monthly crawl"
},
{
"name": "client",
"type": "STRING",
"mode": "REQUIRED",
"description": "Test environment: desktop or mobile"
},
{
"name": "page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the page being tested"
},
{
"name": "is_root_page",
"type": "BOOLEAN",
"mode": "NULLABLE",
"description": "Whether the page is the root of the origin."
},
{
"name": "root_page",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the root page being tested"
},
{
"name": "url",
"type": "STRING",
"mode": "REQUIRED",
"description": "The URL of the request"
},
{
"name": "is_main_document",
"type": "BOOLEAN",
"mode": "REQUIRED",
"description": "Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"
},
{
"name": "type",
"type": "STRING",
"mode": "NULLABLE",
"description": "Simplified description of the type of resource (script, html, css, text, other, etc)"
},
{
"name": "index",
"type": "INTEGER",
"mode": "NULLABLE",
"description": "The sequential 0-based index of the request"
},
{
"name": "payload",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded WebPageTest result data for this request"
},
{
"name": "summary",
"type": "STRING",
"mode": "NULLABLE",
"description": "JSON-encoded summarization of request data"
},
{
"name": "request_headers",
"type": "RECORD",
"mode": "REPEATED",
"description": "Request headers",
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Request header name"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE",
"description": "Request header value"
}
]
},
{
"name": "response_headers",
"type": "RECORD",
"mode": "REPEATED",
"description": "Response headers",
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "Response header name"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE",
"description": "Response header value"
}
]
},
{
"name": "response_body",
"type": "STRING",
"mode": "NULLABLE",
"description": "Text-based response body"
}
]Example queries
Generating May 2022 data for httparchive.all.pages:
CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
let value = $[`_${name}`];
if (typeof value == 'string') {
try {
value = JSON.parse(value);
} catch (e) {
// The value is not a JSON string.
}
}
return [name, value];
})));
''';
CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
function getFeatureNames(featureMap, featureType) {
try {
return Object.entries(featureMap).map(([key, value]) => {
// After Feb 2020 keys are feature IDs.
if (value.name) {
return {'feature': value.name, 'type': featureType, 'id': key};
}
// Prior to Feb 2020 keys fell back to IDs if the name was unknown.
if (idPattern.test(key)) {
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
}
// Prior to Feb 2020 keys were names by default.
return {'feature': key, 'type': featureType, 'id': ''};
});
} catch (e) {
return [];
}
}
var $ = JSON.parse(payload);
if (!$._blinkFeatureFirstUsed) return [];
var idPattern = new RegExp('^Feature_(\\d+)$');
return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
.concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
.concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';
WITH pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
JSON_VALUE(payload, '$._metadata.rank') AS rank,
JSON_VALUE(payload, '$._metadata.crawl_depth') = '0' AS is_root_page,
JSON_VALUE(payload, '$._testID') AS wptid,
GET_CUSTOM_METRICS(payload) AS custom_metrics,
JSON_QUERY(payload, '$._metadata') AS metadata,
payload
FROM
`httparchive.pages.2022_05_01_*`
# TODO: Backfill when the summary pages is ready.
/* ), summary_pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
TO_JSON_STRING(summary_pages) AS summary
FROM
`httparchive.summary_pages.2022_05_01_*` AS summary_pages */
), loose_technologies AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
STRUCT(
app AS technology,
ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
ARRAY_AGG(info) AS info
) AS technology
FROM
`httparchive.technologies.2022_05_01_*`
GROUP BY
client,
page,
app
), techs AS (
SELECT
client,
page,
ARRAY_AGG(technology) AS technologies
FROM
loose_technologies
GROUP BY
client,
page
), lh AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
report AS lighthouse
FROM
`httparchive.lighthouse.2022_05_01_*`
)
SELECT
DATE('2022-05-01') AS date,
client,
page,
is_root_page,
rank,
wptid,
payload,
# TODO: Update when the summary pipeline completes successfully.
'' AS summary,
custom_metrics,
lighthouse,
GET_FEATURES(payload) AS features,
technologies,
metadata
FROM
pages
LEFT JOIN
techs
USING
(client, page)
LEFT JOIN
lh
USING
(client, page)Pages data for April 2022. Changed rank to an INT64 field and added the required root_page field.
CREATE TEMP FUNCTION GET_CUSTOM_METRICS(payload STRING) RETURNS STRING LANGUAGE js AS '''
const $ = JSON.parse(payload);
return JSON.stringify(Object.fromEntries($._custom.map(name => {
let value = $[`_${name}`];
if (typeof value == 'string') {
try {
value = JSON.parse(value);
} catch (e) {
// The value is not a JSON string.
}
}
return [name, value];
})));
''';
CREATE TEMP FUNCTION GET_FEATURES(payload STRING)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
function getFeatureNames(featureMap, featureType) {
try {
return Object.entries(featureMap).map(([key, value]) => {
// After Feb 2020 keys are feature IDs.
if (value.name) {
return {'feature': value.name, 'type': featureType, 'id': key};
}
// Prior to Feb 2020 keys fell back to IDs if the name was unknown.
if (idPattern.test(key)) {
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
}
// Prior to Feb 2020 keys were names by default.
return {'feature': key, 'type': featureType, 'id': ''};
});
} catch (e) {
return [];
}
}
var $ = JSON.parse(payload);
if (!$._blinkFeatureFirstUsed) return [];
var idPattern = new RegExp('^Feature_(\\d+)$');
return getFeatureNames($._blinkFeatureFirstUsed.Features, 'default')
.concat(getFeatureNames($._blinkFeatureFirstUsed.CSSFeatures, 'css'))
.concat(getFeatureNames($._blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';
WITH pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
SAFE_CAST(JSON_VALUE(payload, '$._metadata.rank') AS INT64) AS rank,
COALESCE(JSON_VALUE(payload, '$._metadata.crawl_depth') = '0', TRUE) AS is_root_page,
COALESCE(JSON_VALUE(payload, '$._metadata.root_page_url'), url) AS root_page,
JSON_VALUE(payload, '$._testID') AS wptid,
GET_CUSTOM_METRICS(payload) AS custom_metrics,
JSON_QUERY(payload, '$._metadata') AS metadata,
payload
FROM
`httparchive.pages.2022_04_01_*`
), summary_pages AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
rank,
TO_JSON_STRING(summary_pages) AS summary
FROM
`httparchive.summary_pages.2022_04_01_*` AS summary_pages
), loose_technologies AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
STRUCT(
app AS technology,
ARRAY_AGG(DISTINCT category ORDER BY category) AS categories,
ARRAY_AGG(info) AS info
) AS technology
FROM
`httparchive.technologies.2022_04_01_*`
GROUP BY
client,
page,
app
), techs AS (
SELECT
client,
page,
ARRAY_AGG(technology) AS technologies
FROM
loose_technologies
GROUP BY
client,
page
), lh AS (
SELECT
_TABLE_SUFFIX AS client,
url AS page,
report AS lighthouse
FROM
`httparchive.lighthouse.2022_04_01_*`
)
SELECT
DATE('2022-04-01') AS date,
client,
page,
is_root_page,
root_page,
COALESCE(pages.rank, summary_pages.rank) AS rank,
wptid,
payload,
summary,
custom_metrics,
lighthouse,
GET_FEATURES(payload) AS features,
technologies,
metadata
FROM
pages
LEFT JOIN
summary_pages
USING
(client, page)
LEFT JOIN
techs
USING
(client, page)
LEFT JOIN
lh
USING
(client, page)Ran into OOM issues with generating the all.requests table directly in BQ.
Since we'll eventually need to generate tables in the new all dataset from Dataflow, I decided to prototype how that pipeline would work. See HTTPArchive/bigquery#170. Summary data is omitted for now since we'll be merging pipelines soon anyway.
I've successfully tested it on a single HAR and now attempting a full-scale test on the entire 2022_05_12 crawl, both desktop and mobile concurrently. I expect it to take 6-8 hours (running for ~2 so far).
The all.requests table is clustered by the following columns:
clientis_root_pageis_main_documenttype
As well as implicitly by the date partitioning on this table.
We're only allowed 4 cluster fields meaning we cannot add any more. However I'm wondering if is_main_document is that useful? It might be much more beneficial to cluster based on page to allow us to cheaply join to the all.pages table. We could still basically filter for is_main_document relatively cheaply by looking at html type and then filtering to is_main_document (appreciate not ALL main documents will be type HTML but most should be). Changing to page would allow us to filter based on rank, and technologies (again both via join) and similarly other items.
WDYT?
is_main_document is equivalent to the old firstHtml field, so having that parity might make migrating to the new schema easier. But I'm happy to explore other faster/cheaper schemas. Would you be willing to create a temp table with the proposed schema and benchmark a few test queries against each?
If I query for
SELECT
JSON_VALUE(lighthouse, '$.finalUrl') AS final_url,
FROM `httparchive.all.pages`
WHERE
date = "2022-10-01"
AND client = 'mobile'
AND is_root_pageThe BQ UI warns me I'm going to be querying 13.65 TB, vs 4.63 TB for the (currently) equivalent
SELECT
JSON_VALUE(report, '$.finalUrl') AS final_url,
FROM `httparchive.lighthouse.2022_10_01_mobile`Is there some partitioning/clustering trick I should be doing to bring the all number more in line with the lighthouse number? Or more that could be done to the table structure? That would be an unfortunate cost increase for querying the raw data.
The old BQ estimation behavior was to not take clustered fields into consideration. I'm also seeing the new behavior showing the 4.67 TB estimate, so maybe they've only partially rolled it out.
via the CLI:
bq query --nouse_legacy_sql --dry_run 'SELECT
JSON_VALUE(lighthouse, "$.finalUrl") AS final_url,
FROM `httparchive.all.pages`
WHERE
date = "2022-10-01"
AND client = "mobile"
AND is_root_page'
also gives me "this query will process upper bound of 15012318480052 bytes of data" (aka 13.65 TiB). Can either of you try it with a secondary cloud project to see if you get the same numbers?
If everyone eventually gets the lower number, obviously this isn't an actual issue.
I see 13.65TB when using a random account. Hmm....
If you run the query it should still only process 4.67 TiB
Well, here is the current plan for this reorganization: https://docs.google.com/document/d/1kNCpVgvC0W77HJzXiNvlchuGjCCY6uymIfv0nH10tF0/edit
Closing as most of the steps were implemented.
We have planned legacy data access deprecation (and cleanup) in March 2025.
Migration of the reports pipeline is pending: HTTPArchive/httparchive.org#938

