Rerunning a workflow with existing BQ data doesn't skip its Dataflow job

Question

Rerunning a workflow with existing BQ data doesn't skip its Dataflow job

rviscomi opened this issue 2 years ago · 1 comments

There was a bug in the pipeline this month where desktop data was written to all.pages but the corresponding combined job failed. After I restarted the pipeline, I was expecting it to skip the job to create desktop data for all.pages but it started it instead.

In the workflow it seems like it expects a rows field in the results, but according to the logs there's no such field:

Manually running the same query directly on BigQuery shows a result of 23132018 (number of rows with client=desktop and date=2023-09-01). So the bigQueryCountAll workflow step should definitely return a number greater than 0 and skip the Dataflow job.

I think the issue is that the query is timing out. We're setting timeoutMs to 30000 (30 seconds) but the query is taking over a minute:

The query result is logged at 16:14:10, which is exactly 30 seconds after the query start time of 4:13:40PM. The query doesn't complete until 4:14:54PM. That also explains why the query result contained jobComplete: false.

IIRC we successfully tested this flow before, but the table may have gotten slower as it's gotten larger over time. To fix the issue (for now) it might be worth updating the timeout to something with more headroom like 300000 (5 minutes).

Answer 1 · 2023-09-26T17:50:38.000Z

Rerunning the workflow now. Looks like the fix worked.