Consider clustering `all.requests` table by `page` or `rank`
tunetheweb opened this issue · 4 comments
One thing I find really handy for the all.pages table is setting rank = 1000 as a quick way to get results and save costs but still see real data (often the more interesting data too, to be honest!).
We can't do that with the all.requests table. We also can't quickly look up the data for a simple site so can't do this via the all.pages table either. It would be handy to be able to do either of these by clustering the all.requests table by page or rank .
Now there are a max of 4 clustering columns and we're already using 4 for all.requests:
clientis_root_pageis_main_documenttype
These are all useful so we'd need to drop one if we wanted to add a new column.
I think is_main_document is useful, but can mostly be repeated by type='html' AND is_main_document (not entirely but 99.8% of cases and the most useful ones!) so I'd prefer to replace that with either page or rank. I'm thinking page as can use that to get rank, but open to ideas.
Or maybe we should have wptid in requests table top allow joins on that instead of page?
Considering there is currently only 1 column to replace, I'd better go with a more granular page. Helpful for various page category analysis and debugging.
If needed, we could reproduce rank clustering by including only particular rank URLs from a CrUX list.
I think there is a unique wptid per each page value, no? In this case page is more readable alternative.
Here are queries for two sampled tables clustered by rank and page each:

There is no optimization happening with page-clustered table.
Here I added a summary column and used a temporary table for a more obvious and fairer bytes comparison.

Same result.
I believe the cluster column query limitation doesn't allow to use page flexibly.
And without this, rank is more useful for analysis.
@tunetheweb did you have another case in mind for page clustering?
Ah that's disappointing.
There are pther benefits to clustering on page for select pages (e.g. get me all the requests for https://www.example.com is currently quite expensive as requires a full table load), but given HTTP Archive is mostly about bulk analysis, and that this doesn't also allow rank anyway based on your experiments, I agree rank is more useful.