The new schema and cost concerns for users
tunetheweb opened this issue · 6 comments
So we have a new schema in the all dataset which basically has two tables:
pagesrequests
Previously we have had separate schemas for each data type (pages, requests, lighthouse, response bodies, technologies) and also summary tables (summary_pages, summary_requests).
There is a LOT to like about the new schema, including:
- Easier to query across dates without having to use
_TABLE_SUFFIX - Do not need to join as everything in one place
- Contains secondary pages
And there are some good cost benefits:
- The
pagestable is partitioned ondateand clustered onclient,is_root_page, andrank. This means, for these columns, you only pay for the rows you query. This saves time and real dollars. In the old schema you paid the full amount for every table you queried even if you were only getting a few rows. - The
requeststable is partitioned ondateand clustered onclient,is_root_page,is_main_document, andtype, with same benefits as above. - You MUST supply a date
WHEREclause. This prevents querying the whole table, which could be VERY expensive. The query will not run without a date. You can, however, use a date range (e.g.WHERE date > '1900-01-01) for trend queries - though you shouldn't do that for really expensive queries. - A number of pieces of data are available in each of the two tables, avoiding you having to query the expensive
payloadcolumn for common things. This was basically available before with thesummarytables, but has been greatly enhanced. For pages we now pull our rank, all the previous summary data and custom metrics into their own columns. This can also be enhanced in future to pull out more things (e.g. page-level CrUX data, lighthouse scores, or particular audits). - For requests we pull out type, summary data as before, and JSON-based request headers and response headers.
So we want to migrate people to the new schema as, in general, it's easier to use, and costs less (in dollars AND time).
However, I have one concern with the new schema and everything being in the one table, as opposed to split out before. This makes the (VERY!) expensive payload, lighthouse and reponse_bodies data much easier to query. I don't think below are good defaults:
Coming from a more traditional RDMS background, it's quite common in my experience to run a SELECT * on a table to view it, and I worry BigQuery newbies could do this and end up with HUGE bills. BigQuery does have the Schema and Preview options on tables and these are much better than using SELECT * but, as I say, not everyone is a BigQuery user.
We can (and have) insisted on a partition field (date) but BigQuery does not allow us to insist on columns being explicitly given so we cannot prevent people running above sort of SELECT * queries.
We have a number of options to address this concern of mine:
- Do nothing. There was always the risk of someone doing a large
SELECTeven in the old schema and in many ways it's better now, even if in some ways maybe it's worse. - Remove the expensive columns from the
all.pagesandall.requeststables and switch back to having separatepayload,lighthouseandresponse_bodiestables. This then means either 1) Extra joins or 2) Duplicating data so loses some of the benefits of the new Schema. It would also be a change to our pipeline. - Create smaller, summary Views on the
alltables within theallschema (e.g.all.summary_pages,all.summary_requests), without those expensive columns and encourage their use, over theall.pagesandall.requeststables, especially for beginners. But is this more confusing and will it lead to people writing the same query in two different ways? - Create smaller, broken up Views in a separate schema (e.g. a
httparchive.queryschema with broken down tables:httparchive.query.summary_pages,httparchive.query.pages,httparchive.query.lighthouse,httparchive.query.summary_requests,httparchive.query.response_bodies) and encourage their use, with people free to use theallschema if they need to, to avoid joins. We could even not bother with theresponse_bodiestable in this basic Schema and say that's only for experts. But is this more confusing and will it lead to people writing the same query in two different ways?
The Views can be created once and will automatically update so I don't think maintenance is an issue.
For example, in the latest schema we currently have three new view that automatically look at the latest month's data and also look at a subset of the data:
CREATE OR REPLACE VIEW `httparchive.latest.pages` AS (
SELECT
* EXCEPT (lighthouse)
FROM
`httparchive.all.pages`
WHERE
-- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.lighthouse` AS (
SELECT
date,
client,
page,
is_root_page,
root_page,
rank,
wptid,
lighthouse AS report,
metadata
FROM
`httparchive.all.pages`
WHERE
-- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.requests` AS (
SELECT
* EXCEPT (response_body)
FROM
`httparchive.all.requests`
WHERE
-- latest date only (this part not shown for brevity)
)I'd be interested to hear views on this (@rviscomi I know you have some as we've discussed), and whether we need to do anything for this?
Let's use this issue to explore ways to minimize costs related to the summary field of the pages/requests datasets. Some ideas:
- Remove redundant fields from summary (url, rank, metadata, etc)
- Split out groups of related useful summary metrics (# requests, bytes per resource type)
- Remove obsolete fields (adult, etc)
Important to note that there are two summary fields - one for page, one for requests.
pages summary column
Remove these columns as covered by other columns (e.g. meta data):
- "metadata": "{"rank": 10000, "...
- "pageid": 18058825,
- "createDate": 1658295068,
- "startedDateTime": 1656908528,
- "archive": "All",
- "label": "Jul 1 2022",
- "crawlid": 0,
- "url": "https://www.example.com.com/page/",
- "urlhash": 4963,
- "urlShort": "https://www.example.com/page/",
- "wptid": "220701_Mx0_2HGl0",
- "wptrun": 1,
- "rank": 10000
Remove these as don't seem to be used anymore, and probably better covered by custom metrics:
- "PageSpeed": null,
- "_adult_site": false,
- "avg_dom_depth": 0,
- "doctype": null,
- "document_height": 0,
- "document_width": 0,
- "localstorage_size": 0,
- "sessionstorage_size": 0,
- "meta_viewport": null,
- "num_iframes": null,
- "num_scripts": null,
- "num_scripts_sync": null,
- "num_scripts_async": null,
- "usertiming": null,
There's argument to remove more, but think those are easy, non controversial wins and will half the size of this column (at least).
I also wonder if the features and technologies should be JSON rather than array? Selecting two of these (e.g. website that use Wordpress AND jQuery) is more painful with arrays (requires a join AFAIK?). But on the other hand unnesting with JSON requires a JavaScript function AFAIK so pluses and minuses. However does seem a little inconsistent to use JSON in some places and arrays in others, unless we have a good reason?
requests summary column
Remove these as unlikely to be used and can get from payload if really needed
- "requestid": 107588217800228970,
- "pageid": 25049834,
- "crawlid": 0,
- "startedDateTime": 1656980103,
- "url": "https://sideline2u.com/wp-content/plugins/wc-frontend-manager/assets/sounds/notification.mp3",
- "urlShort": "https://sideline2u.com/wp-content/plugins/wc-frontend-manager/assets/sounds/notification.mp3",
- "firstReq": true,
- "firstHtml": true
Remove these as covered by request_headers or response_headers:
- "reqOtherHeaders": "range = bytes=0-5310, sec-ch-ua = " Not A;...",
- "respOtherHeaders": "alt-svc = h3=":443"; ma=86400, ...2",
- "req_accept": "/",
- "req_accept_encoding": "identity;q=1, *;q=0",
- "req_accept_language": "en-US,en;q=0.9",
- "req_if_modified_since": "Tue, 01 Feb 2022 17:08:04 GMT",
- "req_if_none_match": ""61f968f4-2262"",
- "req_referer": "https://example.com/...",
- "req_user_agent": "Mozilla/5.0 (Linux; A...",
- "resp_age": "6123",
- "resp_cache_control": "max-age=300",
- "resp_date": "Tue, 05 Jul 2022 00:15:04 GMT",
- "resp_etag": ""61f968f4-2262"",
- "resp_last_modified": "Tue, 01 Feb 2022 17:08:04 GMT",
- "resp_server": "cloudflare",
- "resp_vary": "Accept-Encoding",
- "resp_content_length": "8802",
- "resp_content_type": "audio/mpeg",
Similar to comment above, I wonder if we should use JSON rather than Arrays for Response Headers and Request Headers?
I also wonder if the features and technologies should be JSON rather than array? Selecting two of these (e.g. website that use Wordpress AND jQuery) is more painful with arrays (requires a join AFAIK?). But on the other hand unnesting with JSON requires a JavaScript function AFAIK so pluses and minuses. However does seem a little inconsistent to use JSON in some places and arrays in others, unless we have a good reason?
Might not be the best solution, but the WordPress/jQuery example is possible without joins:
WITH pages AS (
SELECT
page,
ARRAY_AGG(t.technology) AS technologies
FROM
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT),
UNNEST(technologies) AS t
WHERE
date = '2022-10-01' AND
client = 'mobile'
GROUP BY
page
)
SELECT
page
FROM
pages
WHERE
'WordPress' IN UNNEST(technologies) AND
'jQuery' IN UNNEST(technologies)Also possible to process a JSON-encoded array of technologies without a UDF:
WITH json AS (
SELECT
page,
TO_JSON(technologies) AS technologies
FROM
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
WHERE
date = '2022-10-01' AND
client = 'mobile'
),
pages AS (
SELECT
page,
ARRAY_AGG(JSON_VALUE(t, '$.technology')) AS technologies
FROM
json,
UNNEST(JSON_QUERY_ARRAY(technologies, '$')) AS t
GROUP BY
page
)
SELECT
page
FROM
pages
WHERE
'WordPress' IN UNNEST(technologies) AND
'jQuery' IN UNNEST(technologies)IMO using arrays where possible is more semantic and avoids unnecessary decoding steps. There might be efficiency benefits, but that's not my main motivation. I do think it improves the QX in the general case—querying a single technology at a time or querying all technologies.
Btw, another (admittedly small) grievance with arrays, is it makes the preview more difficult to use, as can't scroll down (e.g. I want a Lighthouse payload, but first one doesn't have what I want, and scrolling down is painful due to arrays of "useless" data). Noticed this type of thing more and more that I use this and kinda annoying. Any easy workaround I'm missing?
I tend to use AND rank = 1000 instead of preview because of this.
Great insights @tunetheweb, if that's actually not used then we'll be able to go down to:
- 43% (44 VS 104 Gb) of column size in
all.pagesand - 20% (1.1 Tb VS 5.5Tb) in
all.requests.
Here is an estimation query:
CREATE TEMP FUNCTION `prune_object`(
json_str STRING,
keys_to_remove ARRAY<STRING>
) RETURNS STRING
LANGUAGE js AS """
try {
var jsonObject = JSON.parse(json_str);
keys_to_remove.forEach(function(key) {
delete jsonObject[key];
});
return JSON.stringify(jsonObject);
} catch (e) {
return json_str;
}
""";
SELECT
SUM(BIT_COUNT(CAST(summary AS BYTES))) * 2 / 1024 / 1024 / 1024 AS summary_Gb,
SUM(BIT_COUNT(CAST(summary_pruned AS BYTES))) * 2 / 1024 / 1024 / 1024 AS summary_pruned_Gb,
SUM(BIT_COUNT(CAST(summary_pruned AS BYTES))) / SUM(BIT_COUNT(CAST(summary AS BYTES))) AS share
FROM (
SELECT
summary,
prune_object(
summary,
["metadata", "pageid", "createDate", "startedDateTime", "archive", "label", "crawlid", "url", "urlhash", "urlShort", "wptid", "wptrun", "rank", "PageSpeed", "_adult_site", "avg_dom_depth", "doctype", "document_height", "document_width", "localstorage_size", "sessionstorage_size", "meta_viewport", "num_iframes", "num_scripts", "num_scripts_sync", "num_scripts_async", "usertiming"]) as summary_pruned
FROM `all.pages` TABLESAMPLE SYSTEM (5 PERCENT)
WHERE date = '2024-08-01'
)I've been looking into an idea of using JSON columns for custom_metrics and other payloads.
BigQuery promises cost, performance and QX improvements.
I like an idea of getting from:
CAST(JSON_VALUE(JSON_EXTRACT(custom_metrics, '$.wpt_bodies'), '$.raw_html.conditional_comment_count') AS INT64) and hasLazyHeuristics(JSON_EXTRACT(custom_metrics, '$.performance.lcp_elem_stats.attributes')) AS custom_lazy
to:
CAST(custom_metrics.wpt_bodies.raw_html.conditional_comment_count AS INT64) and hasLazyHeuristics(custom_metrics.performance.lcp_elem_stats.attributes') AS custom_lazy.
We would still write strings to staging (there are limitations for writing JSON directly), but then transform it while appending to all.
It's a bit of a stretch, but worth checking now than after the backfill.
Have you discussed this before? Concerns to verify?
A few things I have noticed:
- JSON data format works poorly with arrays, but great with object properties.
See an example query wheretechnologiesandcustom_metricsare JSON columns.
You may notice how easier it became to querycustom_metricsproperties and more complicated fortechnologies:
SELECT
page,
metadata.rank,
payload,
custom_metrics.ads.ads.present, -- nesting with dot notation
categories
FROM `httparchive.scratchspace.pages_10k_JSON`,
UNNEST(JSON_QUERY_ARRAY(technologies)) AS tech, -- JSON is not supported for unnesting
UNNEST(JSON_QUERY_ARRAY(tech, "$.categories")) AS categories
WHERE client = "mobile"
AND is_root_page
AND rank = 1000I share the feelings about the UI issues related to REPEATED columns (issuetracker), but much more I like to use a simpler syntax, e.g.:
SELECT
page
FROM `httparchive.all.pages`
WHERE
date = '2024-09-01' AND
client = 'mobile' AND
is_root_page AND
rank = 10000 AND
'WordPress' IN UNNEST(technologies.technology) AND
'6.6.2' IN UNNEST(technologies.info)So let's hope for UI improvements.
wide_number_modemakes a difference for JSON parsing.
Usually native JSON functions work faster, but poorer thanJSON.parsevia JS UDF. Addingwide_number_mode => 'round'seems to fix the parsing issues in most cases for me.
CREATE TEMP FUNCTION parse(cm STRING)
RETURNS JSON
LANGUAGE js AS """
try {
cm = JSON.parse(cm);
} catch (e) {
cm = null;
}
return cm;
""";
SELECT
custom_metrics AS original,
SAFE.PARSE_JSON(custom_metrics) AS native_parse_default,
SAFE.PARSE_JSON(custom_metrics, wide_number_mode => 'round') AS native_parse_rounded,
parse(custom_metrics) AS udf_parse
FROM `sample_data.pages_10k`
WHERE client = "mobile"
AND is_root_page
AND rank = 1000
AND SAFE.PARSE_JSON(custom_metrics) IS NULL

