HTTPArchive/data-pipeline

Incomplete HTML response bodies

rviscomi opened this issue · 2 comments

Noticed an unusually high number of pages without a <body> tag in the HTML response body. Here's a query to take a sample:

WITH req AS (
  SELECT
    page,
    response_body
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (0.10 PERCENT)
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document AND
    is_root_page AND
    REGEXP_EXTRACT(response_body, r'(?i)(.*)<body') IS NULL
  LIMIT 10
),

pages AS (
  SELECT
    page,
    wptid
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_root_page
)


SELECT
  page,
  wptid,
  response_body
FROM
  req
JOIN
  pages
USING
  (page)

For example, here's one WPT response body and you can see it cut off at mid-CSS: https://webpagetest.httparchive.org/response_body.php?test=230509_Mx1ZG_FBEG4&run=1&bodyid=741FA7D767C970D82CCE5621C0B68519

.et_animated.slideLeft{-webkit-animation-name:et_pb_slideLeft;animation-name:et_pb_slideLeft}@-webkit-keyframes et_pb_bounce{0%,20%,40%,60%,80%,to{-webkit-animation-timing-function:cubic-bezier(.215,.61,.355

The page seems to render fine in the test, as the filmstrip shows visual content and the waterfall is full of requests. Viewing source on the live page also shows complete HTML.

So maybe there's something in WPT or the HA pipeline that's cutting off the response body?

I'll look closer during the week in case there's something going on with the netlog body streaming but, just in case it's an issue, I just doubled the size of the agent SSD's from 10GB to 20GB (and got approval for the quota).

The test agent we use for manual tests had run out of disk space and was failing so it is running with a lot more space than the regular agents but it is also persistent for months while the agents are just alive for a day or so (and I'd expect much worse side-effects than some truncated bodies) but it doesn't hurt to have a bit more breathing room on the disks.

Besides the disks fix, I suppose the data was lost. So there is nothing else to do.
Closing.