HTTPArchive/custom-metrics

Implement CMS-focused custom metric for whether a site uses a WordPress block theme

felixarntz opened this issue · 1 comments

Per the HTTP Archive discussion on Slack: We want to find out how commonly the so-called "block themes" are used across WordPress sites. A site can be identified as using a block theme if it includes a <div class="wp-site-blocks"> element. Additionally, only WordPress sites of version 5.9 or higher are technically able to use block themes.

I originally wrote a query for this in GoogleChromeLabs/wpp-research#32 (consuming 65TB 🤯). @rviscomi shared with me a more efficient version of that query, which I've pasted below. However, using DOM APIs over a regular expression would be more straightforward, so a custom metric for this would be valuable.

For reference, here's the alternative query created by @rviscomi:

WITH wordpress AS (
  SELECT DISTINCT
    client,
    page
  FROM
    `httparchive.all.pages`,
    UNNEST(technologies) AS t,
    t.info AS version
  WHERE
    date = '2022-10-01' AND
    is_root_page AND
    t.technology = 'WordPress' AND
    (version = '' OR CAST(REGEXP_EXTRACT(version, r'^(\d+\.\d+)') AS FLOAT64) >= 5.9)
),

block_themes AS (
  SELECT
    client,
    page
  FROM
    `httparchive.all.requests`
  WHERE
    date = '2022-10-01' AND
    is_root_page AND
    is_main_document AND
    REGEXP_CONTAINS(response_body, r'<div class="wp-site-blocks">')
)


SELECT
  client,
  COUNT(0) AS pages
FROM
  wordpress
JOIN
  block_themes
USING
  (client, page)
GROUP BY
  client

Fixed via #62.