Investigate impact of first response body not being HTML
rviscomi opened this issue · 3 comments
Some custom metrics rely on $WPT_BODIES[0]
being the main HTML document. However, we've seen some edge cases (on up to 15% of pages) where the first request does not correspond to the main document. These custom metrics would assumedly be processing the data for the wrong response body.
Investigate whether this is actually happening and how to fix it, if so.
It doesn't seem like the $WPT_BODIES
object is affected by this issue.
Here is a test from the 2022_04_01_desktop crawl in which the URL was incorrectly parsed as http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt
. The request at entries[0]
in the HAR corresponds to the certificate and _is_base_page
is set to true.
{
"_full_url": "http://cacerts.digitalcertvalidation.com/TrustAsiaTLSRSACA.crt",
"_is_base_page": true,
"_index": 0
}
I reran the test with a custom metric that outputs the full $WPT_BODIES
object. The object schema is not quite the same as the HAR but it's clear that the first item is the HTML document itself, not the cert:
{
"url": "https://52.mk/",
"type": "Document"
}
So I think we're ok.
The waterfall and main request data are post-processed using the netlog trace events. The $WPT_BODIES (and $WPT_REQUESTS) use the dev tools request details which don't see things like OCSP checks so the first request should (hopefully) always be the actual navigation.
Great thanks for confirming