Add monthly CWV Tech Report materialization query

Question

Add monthly CWV Tech Report materialization query

rviscomi opened this issue 2 years ago · 7 comments

The query to materialize the technologies table with the latest CWV Tech Report data should be run before the new pipeline, as those queries read from the table.

Answer 1 · 2024-03-10T23:41:18.000Z

I've added a scheduled query with the hard-coded default date replaced with @run_date

product documentation
Scheduled query: Core Web Vitals Technologies Report

Note: there is a bug with scheduled queries using @run_date with partitioned tables and required filters (issue tracker), so the scheduled query includes a hacky workaround not present in the original query.

This scheduled query is currently set to run on-demand in order to be triggered manually from some other source (e.g. API call) rather than on the first of the month, since the execution date may vary from the run date.

Answer 2 · 2024-03-10T23:45:53.000Z

Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the all and summary_* datasets. This query may need adjustment if it needs to be run before the new pipeline.

Answer 3 · 2024-03-12T10:47:26.000Z

Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the all and summary_* datasets. This query may need adjustment if it needs to be run before the new pipeline.

The crawl starts after the CrUX data is published on the second Tuesday of the month. So the all data is available for February, but March will only start this week. We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!) The CrUX data is for the full month so won't show until April 9th (second Tuesday of the next month) and it's joined to the crawl data for this report.

I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?

Answer 4 · 2024-03-12T17:47:04.000Z

The crawl starts after the CrUX data is published on the second Tuesday of the month.

Ok, perfect! The data isn't "delayed"; instead, I ran the query with dates too far into the future at that point in time.

We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!)

That's good. The query mentioned in the issue description is written to use tables in the all dataset only. I mentioned the summary_* datasets as well just in case there may be a reason to use them as an alternative.

I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?

In my experience, the Google Cloud console's query validator is not very accurate, especially when working with scripts, partitioned tables, and filters. I ran this scheduled query during testing and it processed 17.27TB (rather than the estimated 337TB).

giancarlo_faranda@cloudshell:~ (httparchive)$ bq show --location=US --job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c                                                                                                                                                     
Job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c

  Job Type    State      Start Time         Duration              User Email            Bytes Processed    Bytes Billed    Billing Tier                       Labels                       
 ---------- --------- ----------------- ---------------- ----------------------------- ----------------- ---------------- -------------- ------------------------------------------------- 
  query      SUCCESS   10 Mar 22:41:23   0:07:59.888000   Giancarlo.Faranda@gmail.com   18990487562745    18990488354816                  dts_run_id:65fd9c6e-0000-255a-8802-582429c39d0c  
                                                                                                                                          data_source_id:scheduled_query

Answer 5 · 2024-03-12T17:50:48.000Z

OK cool. Will give it a go next month then. I presume it is NOT automated yet to kick off? Especially for this month since I've manually run it for this month.

Answer 6 · 2024-03-12T18:22:11.000Z

@tunetheweb Correct, it is not automated yet but I think we trigger this from the same Pub/Sub subscription that initiates the crawl.

Is this the topic which signals the start of the crawl? projects/httparchive/topics/crux-updated (CC @pmeenan @rviscomi )

Answer 7 · 2024-09-09T23:16:53.000Z

Closed in HTTPArchive/dataform#3