Investigate why the Dataflow jobs stall on the remaining HARs
rviscomi opened this issue · 1 comments
The November mobile combined job failed today after running for 4 days.
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. If the logs only contain generic timeout errors related to accessing external resources, such as MongoDB, verify that the worker service account has permission to access the resource's subnetwork. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
data-pipeline-combined-20-12122204-ji2x-harness-8v0d Root cause: The worker lost contact with the service
Investigate what is causing the pipeline to get stalled and eventually fail. Previous month's jobs did complete successfully, but after behaving similarly.
Issue resolved, possibly related to the Ubuntu upgrades and/or gzip filtering logic