Add monthly CWV Tech Report materialization query
rviscomi opened this issue · 7 comments
The query to materialize the technologies table with the latest CWV Tech Report data should be run before the new pipeline, as those queries read from the table.
I've added a scheduled query with the hard-coded default date replaced with @run_date
- product documentation
- Scheduled query: Core Web Vitals Technologies Report
Note: there is a bug with scheduled queries using @run_date with partitioned tables and required filters (issue tracker), so the scheduled query includes a hacky workaround not present in the original query.
This scheduled query is currently set to run on-demand in order to be triggered manually from some other source (e.g. API call) rather than on the first of the month, since the execution date may vary from the run date.
Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the all and summary_* datasets. This query may need adjustment if it needs to be run before the new pipeline.
Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the
allandsummary_*datasets. This query may need adjustment if it needs to be run before the new pipeline.
The crawl starts after the CrUX data is published on the second Tuesday of the month. So the all data is available for February, but March will only start this week. We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!) The CrUX data is for the full month so won't show until April 9th (second Tuesday of the next month) and it's joined to the crawl data for this report.
I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?
The crawl starts after the CrUX data is published on the second Tuesday of the month.
Ok, perfect! The data isn't "delayed"; instead, I ran the query with dates too far into the future at that point in time.
We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!)
That's good. The query mentioned in the issue description is written to use tables in the all dataset only. I mentioned the summary_* datasets as well just in case there may be a reason to use them as an alternative.
I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?
In my experience, the Google Cloud console's query validator is not very accurate, especially when working with scripts, partitioned tables, and filters. I ran this scheduled query during testing and it processed 17.27TB (rather than the estimated 337TB).
giancarlo_faranda@cloudshell:~ (httparchive)$ bq show --location=US --job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c
Job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------------- ----------------------------- ----------------- ---------------- -------------- -------------------------------------------------
query SUCCESS 10 Mar 22:41:23 0:07:59.888000 Giancarlo.Faranda@gmail.com 18990487562745 18990488354816 dts_run_id:65fd9c6e-0000-255a-8802-582429c39d0c
data_source_id:scheduled_query OK cool. Will give it a go next month then. I presume it is NOT automated yet to kick off? Especially for this month since I've manually run it for this month.
@tunetheweb Correct, it is not automated yet but I think we trigger this from the same Pub/Sub subscription that initiates the crawl.
Is this the topic which signals the start of the crawl? projects/httparchive/topics/crux-updated (CC @pmeenan @rviscomi )
Closed in HTTPArchive/dataform#3
