Convert the `latest` dataset to views
rviscomi opened this issue · 16 comments
We have scheduled queries that generate the tables in the httparchive.latest dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitioned all dataset to make this process more streamlined and maintenance-free?
As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries.
My main concerns would be:
- The wildcard issue
- The
alldatastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.
One possibility is to create a httparchive.all.latest_date table with one date (e.g. 2022-09-01) and then create the view using something like:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client
....
FROM
`httparchive.all.pages`
JOIN
`httparchive.all.latest_date`
USING (date)
WHERE
client = 'desktop'
);And then could just update httparchive.all.latest_date at the end of each month's run to signal that data is now ready, and all the views referencing that would automatically switch to the new date.
I ran this, and it seemed to work:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
SELECT CAST('2022-08-01' AS DATE) AS date
)
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
JOIN
`httparchive.scratchspace.test_latest_date`
USING (date)
WHERE
client = 'desktop'
);and I get these costs:
SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MBMy main concerns would be:
- The wildcard issue
- The
alldatastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.
Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.
Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.
Could you clarify if the wildcard issue applies to the
alldataset? Unlike the dated tables likepages.2022_09_01that need wildcards likepages.*to process multiple dates, theall.pagestable is partitioned by date.
No, the wildcard issue only applies if And httparchive.latest.summary_pages_desktop and httparchive.latest.summary_pages_mobile are views on the all dataset and you want to run this:
SELECT
_TABLE_SUFFIX AS client,
col1
FROM
`httparchive.latest.summary_pages_*`That will not work, as it's using a wildcard on two views.
But if you did this to query one table, it would work fine:
SELECT
col1
FROM
``httparchive.latest.summary_pages_desktopAlso, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.
Oh yeah keep forgetting this!
We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date.
I tried doing this, but it didn't work:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
client = 'desktop' AND
date = (SELECT max(date) from `httparchive.all.pages`)
);
SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`;The last SELECT complains of a missing required date param.
Where as hard coding the date (AND date = 2022-08-01') in the view definition, or joining to a latest_date` table, allows the last SELECT to run fine.
The all tables are clustered by client, so is there any reason not to have a single view for each legacy table type, for example latest.summary_pages? The user could filter to a client with WHERE client = 'desktop' if needed. Based on your other experiment, it seems like it should have the same performance gains as applying the filter on the clustered table itself.
Alternatively, users can UNION ALL both desktop/mobile views together if needed.
The last
SELECTcomplains of a missing required date param.
Try something like
SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:
SELECT
MAX(partition_id)
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__'Alternatively, users can UNION ALL both desktop/mobile views together if needed.
Don't think that would work as need _TABLE_SUFFIX to get the client. Could add client column to latest views, but seems a little redundant and still would require the changes (change to UNION ALL and use client column). Not the worst, but a change from what's there currently...
Try something like
SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'
Same error:
Query error: Cannot query over table 'httparchive.all.pages' without a filter over column(s) 'date' that can be used for partition elimination at [16:1]
But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:
I'm not sure how to use this info to create the view though?
This validates:
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(MAX(partition_id) AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'date IS NOT NULL seems to satisfy the validator
Not tested
Close! This works:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'
);
select client from `httparchive.scratchspace.test_latest_desktop2`;Only 322MB processed!
REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3')
Nit/tip: use r'' to avoid escaping the capture groups r'\1-\2-\3'
And adding rank is even quicker:
So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.
In that case I don't see a reason to continue distinguishing between desktop/mobile in the
latesttables/views.
Well that removes the wildcard issue!
Though it is a breaking change. But does anyone even use the latest tables? I never do as prefer to be explicit. Guess we’ll find out when we make this change…
Yeah I filed this issue in response to a DM from a Googler trying to use one of the latest tables. We could have some sort of deprecation period when we support old and new versions of the views, and announce the timeline on the forum/social.
Hi Rick and Barry, is there an update on enabling the .latest table view?
Thank you.
@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works.
I have created three new views in the meantime:
httparchive.latest.pageshttparchive.latest.requestshttparchive.latest.lighthouse
These will automatically point to the latest month of data.
These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the client column to restrict those) but hopefully should be easy to convert your queries to these new ones. Note there are some extra columns in these, which might make them easier (and cheaper and quicker!) to query if you use them.
We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet
Thank you both!
I just checked that the views give pretty good estimation for queries with cluster filters.
So the current solution is the best. Closing.

