Combine Dataflow pipelines
Closed this issue · 1 comments
rviscomi commented
For context, we have an existing Dataflow pipeline (bigquery_import.py) that generates the pages, requests, lighthouse, and response_bodies tables. This pipeline runs at the end of the crawl and processes all HARs at once. In #16 we're creating a new Dataflow pipeline to generate the summary_pages and summary_requests tables. This pipeline runs incrementally whenever a HAR is available for processing, even while other pages are still being tested.
Combine these two Dataflow pipelines so that we have a single streaming pipeline that generates the data for all tables one page (HAR) at a time.