Investigate views on `all` dataset
tunetheweb opened this issue · 8 comments
@rviscomi , @paulcalvano , @pmeenan as discussed on last call.
I ran the following:
CREATE VIEW `httparchive.scratchspace.test_view_2022_08_01_mobile` AS (
SELECT
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date = '2022-08-01' AND
client = 'mobile'
);Then when I run this:
SELECT page FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`It uses up 1.55 GB - so clearly isn't including the payload (in this view), and definitely not the Lighthouse results (not in this view, but in underlying table).
Changing it to this uses up 1.76GB (so slightly more as two columns):
SELECT page, rank FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`Changing it to either of these uses up 5.38TB:
SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`
SELECT * FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`This SQL is only 409.83 MB:
SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile` WHERE rank = 1000So it looks to me like it IS using the clustering on the underlying table and not charging you the full amount each time, as you thought it would @rviscomi ? Unless I'm missing something?
This means we could in theory move everything to the new all schema (including backfilling) and replace all the existing old tables with views to maintain backwards compatibility for old data, but also have it in the new schema format, without duplicating data.
We could in theory also keep creating these old-style views every month so no one has to learn the new schema - they just get cheaper queries. That would also potentially reduce risk of someone accidentally querying all columns, or large date ranges, and running up huge bills too easily.
However I did discover one flaw in this plan in that you can't do this:
SELECT
_TABLE_SUFFIX AS client,
page,
rank,
payload
FROM
`httparchive.scratchspace.test_view_2022_08_01_*`
WHERE
rank = 1000As you get this error 😔:
Views cannot be queried through prefix. First view httparchive:scratchspace.test_view_2022_08_01_desktop.
Turns out this limitation is documented:
Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.
Still, non-wildcard queries would work on the views.
You could do this:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_view_2022_08_01_star` AS (
SELECT
client AS TABLE_SUFFIX,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date = '2022-08-01'
);But three downsides to this hack:
- The view is called
..._starinstead of...* - The table suffix column is called
TABLE_SUFFIXinstead of_TABLE_SUFFIX - Wouldn't scale to include alternative
*s
So not a drop in replacement.
Plus it's just an awful, awful, hack.
FYI, bugs for wildcard support in views:
I'm gonna close this. The Views look like a good option for latest dataset as being discussed in #141
Given the wildcard issue, they don't look like a good option to fully replicate our current schema, from the new schema. There's also a bigger question as to whether we want to maintain the old schema indefinitely even if we could do that with views.
I do have one further discussion point for potentially using views to restrict access to some of the more expensive queries which I've raised #149 for, but think that's a separate issue to this one, which was to investigate what views meant: did they indeed lead to cheaper queries (yes!) and are there any limits on them (yes - wildcards), so let's close this out.
Reopening. FYI @rviscomi
I'd be interested to use views for the July dataset and seeing what breaks
Quite a few queries in our BigQuery repo that fed our site for a start.
@tunetheweb you reopened it a few months ago, is there still some research to be done?
We have latest views and sample tables, both do their job limiting the data processed.
Most probably the origin of this discussion was addressed already, closing.
