Summary data in `all.requests` issues

Found this out while looking at the combined pipeline issues.

The all pipeline has the following issues

It's null for 404s and other errors, even though these have summary data in the legacy summary_requests table.
It doesn't set firstReq and firstHtml correctly (they are always set to true).

This is because we call the summary code per request here:

Lines 341 to 346 in d047906

    
           summary_request = None 
        
           try: 
        
               status_info = HarJsonToSummary.initialize_status_info(file_name, page) 
        
               summary_request, _, _, _ = HarJsonToSummary.summarize_entry( 
        
                   request, "", "", 0, status_info 
        
               )

And that code was more intended to be called in one go since it does this:

data-pipeline/modules/transformation.py

Lines 406 to 425 in d047906

    
           first_req = False 
        
           first_html = False 
        
           if not first_url: 
        
               if (400 <= status <= 599) or 12000 <= status: 
        
                   logging.warning( 
        
                       f"The first request ({url}) failed with status {status}. status_info={status_info}" 
        
                   ) 
        
                   return None, None, None, None 
        
               # This is the first URL found associated with the page - assume it's the base URL. 
        
               first_req = True 
        
               first_url = url 
        
           if not first_html_url: 
        
               # This is the first URL found associated with the page that's HTML. 
        
               first_html = True 
        
               first_html_url = url 
        
           ret_request.update({"firstReq": first_req, "firstHtml": first_html}) 
        
           return ret_request, first_url, first_html_url, entry_number

You basically need to generate the whole page and all requests, and then lookup this summary_requests array for each request:

    try:
        _, requests = HarJsonToSummary.generate_pages(file_name, har)
    except Exception:
        logging.exception(
            f"Unable to unpack HAR, check previous logs for detailed errors. "
            f"{file_name=}, {har=}"
        )
        return None

    summary_requests = []

    for request in requests:

        try:
            wanted_summary_fields = [
                field["name"]
                for field in constants.BIGQUERY["schemas"]["summary_requests"]["fields"]
            ]

            request = utils.dict_subset(request, wanted_summary_fields)
        except Exception:
            logging.exception(
                f"Unable to unpack HAR, check previous logs for detailed errors. "
                f"{file_name=}, {har=}"
            )
            continue

        if request:
            summary_requests.append(request)

This should be fixed in the streaming writes from the agent for the next crawl: HTTPArchive/wptagent@53189db

This looks good now in the new dataset 🎉

	summary_request = None
	try:
	status_info = HarJsonToSummary.initialize_status_info(file_name, page)
	summary_request, _, _, _ = HarJsonToSummary.summarize_entry(
	request, "", "", 0, status_info
	)

	first_req = False
	first_html = False
	if not first_url:
	if (400 <= status <= 599) or 12000 <= status:
	logging.warning(
	f"The first request ({url}) failed with status {status}. status_info={status_info}"
	)
	return None, None, None, None
	# This is the first URL found associated with the page - assume it's the base URL.
	first_req = True
	first_url = url

	if not first_html_url:
	# This is the first URL found associated with the page that's HTML.
	first_html = True
	first_html_url = url

	ret_request.update({"firstReq": first_req, "firstHtml": first_html})

	return ret_request, first_url, first_html_url, entry_number