HTTPArchive/custom-metrics

Investigate impact of first response body not being HTML

rviscomi opened this issue · 3 comments

Some custom metrics rely on $WPT_BODIES[0] being the main HTML document. However, we've seen some edge cases (on up to 15% of pages) where the first request does not correspond to the main document. These custom metrics would assumedly be processing the data for the wrong response body.

Investigate whether this is actually happening and how to fix it, if so.

It doesn't seem like the $WPT_BODIES object is affected by this issue.

Here is a test from the 2022_04_01_desktop crawl in which the URL was incorrectly parsed as http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt. The request at entries[0] in the HAR corresponds to the certificate and _is_base_page is set to true.

{
    "_full_url": "http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt",
    "_is_base_page": true,
    "_index": 0
}

I reran the test with a custom metric that outputs the full $WPT_BODIES object. The object schema is not quite the same as the HAR but it's clear that the first item is the HTML document itself, not the cert:

{
    "url": "https://52.mk/",
    "type": "Document"
}

So I think we're ok.

The waterfall and main request data are post-processed using the netlog trace events. The $WPT_BODIES (and $WPT_REQUESTS) use the dev tools request details which don't see things like OCSP checks so the first request should (hopefully) always be the actual navigation.

Great thanks for confirming