HTTPArchive/data-pipeline

Summary data in `all.requests` issues

tunetheweb opened this issue · 2 comments

Found this out while looking at the combined pipeline issues.

The all pipeline has the following issues

  • It's null for 404s and other errors, even though these have summary data in the legacy summary_requests table.
  • It doesn't set firstReq and firstHtml correctly (they are always set to true).

This is because we call the summary code per request here:

summary_request = None
try:
status_info = HarJsonToSummary.initialize_status_info(file_name, page)
summary_request, _, _, _ = HarJsonToSummary.summarize_entry(
request, "", "", 0, status_info
)

And that code was more intended to be called in one go since it does this:

first_req = False
first_html = False
if not first_url:
if (400 <= status <= 599) or 12000 <= status:
logging.warning(
f"The first request ({url}) failed with status {status}. status_info={status_info}"
)
return None, None, None, None
# This is the first URL found associated with the page - assume it's the base URL.
first_req = True
first_url = url
if not first_html_url:
# This is the first URL found associated with the page that's HTML.
first_html = True
first_html_url = url
ret_request.update({"firstReq": first_req, "firstHtml": first_html})
return ret_request, first_url, first_html_url, entry_number

You basically need to generate the whole page and all requests, and then lookup this summary_requests array for each request:

    try:
        _, requests = HarJsonToSummary.generate_pages(file_name, har)
    except Exception:
        logging.exception(
            f"Unable to unpack HAR, check previous logs for detailed errors. "
            f"{file_name=}, {har=}"
        )
        return None

    summary_requests = []

    for request in requests:

        try:
            wanted_summary_fields = [
                field["name"]
                for field in constants.BIGQUERY["schemas"]["summary_requests"]["fields"]
            ]

            request = utils.dict_subset(request, wanted_summary_fields)
        except Exception:
            logging.exception(
                f"Unable to unpack HAR, check previous logs for detailed errors. "
                f"{file_name=}, {har=}"
            )
            continue

        if request:
            summary_requests.append(request)

This should be fixed in the streaming writes from the agent for the next crawl: HTTPArchive/wptagent@53189db

This looks good now in the new dataset 🎉