HTTPArchive/data-pipeline

Run November BigQuery pipeline

rviscomi opened this issue · 9 comments

The November crawl is done and ready to be ingested into BigQuery

@giancarloaf is working on automating the new pipeline to run as soon as the crawl is finished, but for now we need to run it manually.

Any update on this? It becomes more of an effort to fix up, after we move into the next month so would be good to have this completed before month end.

Ref: HTTPArchive/httparchive.org#669

The job to update the all dataset is done. The job to create the other tables for mobile is currently running. When that's done I'll start the desktop job.

I created this PR to update the README with a sample of the commands to run. Here's how to run the desktop job, for example:

./run_flex_template.sh combined --parameters input_file=gs://httparchive/crawls_manifest/chrome-Nov_1_2022.txt

The txt file is the crawl manifest (list of HAR files) to optimize the Dataflow job. Otherwise Beam wastes hours unnecessarily listing HARs. Instructions to generate the manifest files are in the README.

I'll keep this issue open until the desktop job is done.

The mobile job failed with what appears to be a transient error. I'm currently running the desktop job and expect it to complete in the next day. After that, I'll retry the mobile job.

Once November is up to date, we also need to start the December jobs. I've already generated the manifest files for desktop and mobile.

Desktop is done. @giancarloaf will run the mobile job as a test of the new workflow.

For the November mobile job:

  • workflow is running here (link)
  • dataflow job is running here (link)

The all pipeline job was skipped as expected (data already available in BigQuery), only the combined pipeline will run.

Added documentation to trigger via Pub/Sub to the flex_template branch in 01dc4fa

The November mobile job failed again, so more investigation is needed.

After investigation it looks like there's some kind of encoding error when the data is being prepared on GCS to be imported into BigQuery. apache/beam#22312 in version 2.43 of Beam seems to address the issue, so we've upgraded versions and are currently rerunning the job. It's expected to run for another day. So far so good, no errors.

It worked! Closing this out.