HTTPArchive/data-pipeline

Populate requestid field

giancarloaf opened this issue · 0 comments

For request ID, we could easily just bitfield the 64-bit integer space and use the high 32-bits for the page ID and the low 32 bits for the request number within a given page. That way we can have 4B pages, each with 4B requests and still guarantee they will be unique.

Request IDs are entirely useless since there's nothing to join them with, so returning null wouldn't break anything.
Edit: I should add that they might be useful to distinguish between repeated requests on the same page. But I think something like a simple array index would be more useful to be able to tell which came first.

Was about to say. We should always have a unique id even if it’s a combination of fields (which pageid + url is not). Happy if we have an array index for all requests within a page.

requestid is currently left null in the new pipeline. I have no preference, but @pmeenan's suggestion should be possible, something like:

page_id = 123
request_num = 456
(page_id << 32) + request_num
# 528280977864

Originally posted by @giancarloaf in #9 (comment)