HTTPArchive/data-pipeline

Investigate why the Dataflow jobs stall on the remaining HARs

rviscomi opened this issue · 1 comments

The November mobile combined job failed today after running for 4 days.

The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:

 data-pipeline-combined-20-12122204-ji2x-harness-8v0d
   Root cause: The worker lost contact with the service

Investigate what is causing the pipeline to get stalled and eventually fail. Previous month's jobs did complete successfully, but after behaving similarly.

Issue resolved, possibly related to the Ubuntu upgrades and/or gzip filtering logic