viant/bqtail

Sometimes BQ load jobs fail on File Not Found error

varelasavela opened this issue · 1 comments

Prerequisites :
Version: latest
Trigger Bucket - multiregional
Other buckets are regional
No limits on max instances of CF
256M on BqTail CF

  1. After checking the job runs I saw that there were 2 jobs on same batch, one of them succeed and the second which run right after the first one, failed
  2. The first batch has been moved by dispatcher with extra 9 sec delay
  3. 23 second delays between upload and bqtail event trigger

It can be related to CF autoscaler, because after several hours of run there are no more such errors

The issue stemmed from CF trigger delivery delay,
it looks like after a batch started executing load job, some additional GCS event for files belonging to that batch window arrived late, (22 seconds after batch technically ended), batch uses file modification time to determine what files are allocated to any given batch.

It is not ideal to have a Google Storage event delivered with 20+ second delays from the time the original file was created, but I guess when there are bursts of the new files created, CF auto scaler may not be efficient enough.

In any case, bqtail batching mechanism uses a shared time-based batch marker file, which name can be reconstructed by any files belonging to a given batch, thus it's easy to address delayed event trigger by making marker file staying longer in the bqdispatch location, while not delaying batch execution itself.
The following commit addresses the delayed GCS events batching duplication issue:

353a2cd