Investigate views on `all` dataset

Question

Investigate views on `all` dataset

tunetheweb opened this issue 3 years ago · 8 comments

@rviscomi , @paulcalvano , @pmeenan as discussed on last call.

I ran the following:

CREATE VIEW `httparchive.scratchspace.test_view_2022_08_01_mobile` AS (
  SELECT
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2022-08-01' AND
    client = 'mobile'
);

Then when I run this:

SELECT page FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

It uses up 1.55 GB - so clearly isn't including the payload (in this view), and definitely not the Lighthouse results (not in this view, but in underlying table).

Changing it to this uses up 1.76GB (so slightly more as two columns):

SELECT page, rank FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

Changing it to either of these uses up 5.38TB:

SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`
SELECT * FROM `httparchive.scratchspace.test_view_2022_08_01_mobile`

This SQL is only 409.83 MB:

SELECT page, rank, payload FROM `httparchive.scratchspace.test_view_2022_08_01_mobile` WHERE rank = 1000

So it looks to me like it IS using the clustering on the underlying table and not charging you the full amount each time, as you thought it would @rviscomi ? Unless I'm missing something?

This means we could in theory move everything to the new all schema (including backfilling) and replace all the existing old tables with views to maintain backwards compatibility for old data, but also have it in the new schema format, without duplicating data.

We could in theory also keep creating these old-style views every month so no one has to learn the new schema - they just get cheaper queries. That would also potentially reduce risk of someone accidentally querying all columns, or large date ranges, and running up huge bills too easily.

However I did discover one flaw in this plan in that you can't do this:

SELECT
  _TABLE_SUFFIX AS client,
  page,
  rank,
  payload
FROM
  `httparchive.scratchspace.test_view_2022_08_01_*`
WHERE
  rank = 1000

As you get this error 😔:

Views cannot be queried through prefix. First view httparchive:scratchspace.test_view_2022_08_01_desktop.

Turns out this limitation is documented:

Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.

Still, non-wildcard queries would work on the views.

Answer 1 · 2022-09-21T21:03:37.000Z

You could do this:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_view_2022_08_01_star` AS (
  SELECT
    client AS TABLE_SUFFIX,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2022-08-01'
);

But three downsides to this hack:

The view is called ..._star instead of ...*
The table suffix column is called TABLE_SUFFIX instead of _TABLE_SUFFIX
Wouldn't scale to include alternative *s

So not a drop in replacement.

Plus it's just an awful, awful, hack.

Answer 2 · 2022-09-22T11:55:58.000Z

FYI, bugs for wildcard support in views:

Answer 3 · 2022-10-21T20:16:35.000Z

I'm gonna close this. The Views look like a good option for latest dataset as being discussed in #141

Given the wildcard issue, they don't look like a good option to fully replicate our current schema, from the new schema. There's also a bigger question as to whether we want to maintain the old schema indefinitely even if we could do that with views.

I do have one further discussion point for potentially using views to restrict access to some of the more expensive queries which I've raised #149 for, but think that's a separate issue to this one, which was to investigate what views meant: did they indeed lead to cheaper queries (yes!) and are there any limits on them (yes - wildcards), so let's close this out.

Answer 4 · 2024-06-20T19:49:15.000Z

Reopening. FYI @rviscomi

Answer 5 · 2024-06-20T19:58:19.000Z

I'd be interested to use views for the July dataset and seeing what breaks

Answer 6 · 2024-06-20T21:27:35.000Z

Quite a few queries in our BigQuery repo that fed our site for a start.

Answer 7 · 2024-10-18T22:35:20.000Z

@tunetheweb you reopened it a few months ago, is there still some research to be done?

Answer 8 · 2024-12-18T17:27:31.000Z

We have latest views and sample tables, both do their job limiting the data processed.
Most probably the origin of this discussion was addressed already, closing.