HTTPArchive/data-pipeline

Combine Dataflow pipelines

Closed this issue · 1 comments

For context, we have an existing Dataflow pipeline (bigquery_import.py) that generates the pages, requests, lighthouse, and response_bodies tables. This pipeline runs at the end of the crawl and processes all HARs at once. In #16 we're creating a new Dataflow pipeline to generate the summary_pages and summary_requests tables. This pipeline runs incrementally whenever a HAR is available for processing, even while other pages are still being tested.

Combine these two Dataflow pipelines so that we have a single streaming pipeline that generates the data for all tables one page (HAR) at a time.

We're also developing the all dataset pipeline in #75. Ideally all of these BQ writes will happen in the same pipeline.