Summary data in `all.requests` issues
tunetheweb opened this issue · 2 comments
Found this out while looking at the combined pipeline issues.
The all pipeline has the following issues
- It's
nullfor 404s and other errors, even though these have summary data in the legacysummary_requeststable. - It doesn't set
firstReqandfirstHtmlcorrectly (they are always set totrue).
This is because we call the summary code per request here:
data-pipeline/modules/import_all.py
Lines 341 to 346 in d047906
And that code was more intended to be called in one go since it does this:
data-pipeline/modules/transformation.py
Lines 406 to 425 in d047906
You basically need to generate the whole page and all requests, and then lookup this summary_requests array for each request:
try:
_, requests = HarJsonToSummary.generate_pages(file_name, har)
except Exception:
logging.exception(
f"Unable to unpack HAR, check previous logs for detailed errors. "
f"{file_name=}, {har=}"
)
return None
summary_requests = []
for request in requests:
try:
wanted_summary_fields = [
field["name"]
for field in constants.BIGQUERY["schemas"]["summary_requests"]["fields"]
]
request = utils.dict_subset(request, wanted_summary_fields)
except Exception:
logging.exception(
f"Unable to unpack HAR, check previous logs for detailed errors. "
f"{file_name=}, {har=}"
)
continue
if request:
summary_requests.append(request)This should be fixed in the streaming writes from the agent for the next crawl: HTTPArchive/wptagent@53189db
This looks good now in the new dataset 🎉