Combine Dataflow pipelines

Question

Combine Dataflow pipelines

Closed this issue 3 years ago · 1 comments

For context, we have an existing Dataflow pipeline (bigquery_import.py) that generates the pages, requests, lighthouse, and response_bodies tables. This pipeline runs at the end of the crawl and processes all HARs at once. In #16 we're creating a new Dataflow pipeline to generate the summary_pages and summary_requests tables. This pipeline runs incrementally whenever a HAR is available for processing, even while other pages are still being tested.

Combine these two Dataflow pipelines so that we have a single streaming pipeline that generates the data for all tables one page (HAR) at a time.

Answer 1 · 2022-06-09T01:15:02.000Z

We're also developing the all dataset pipeline in #75. Ideally all of these BQ writes will happen in the same pipeline.