Q: How do you handle split operations in case of failure

Question

Q: How do you handle split operations in case of failure

ktopcuoglu opened this issue 5 years ago · 5 comments

I try to follow behaviour from codes but i lost.

When I use split and transform config for append incoming data across different final tables.

When are the Transform operations applied? as i understand it, these operations working when inserting to the target table from transient table. So I can't modify transforms per split table right? Also do all tables have to have the same schema?
When most of the Split operations succeed but one of them failed with some reason, (temporary internal error or permanent error like column type mismatch, or a transform error) what is the expected behaviour?
When using transient table, what is the usage/role of Dest.Table ?

When:
  Prefix: /data/clickstream/
  Suffix: .json.gz
Async: true
Dest:
  Table: bqtail.dummy
  Transient:
    Dataset: temp
  Transform:
    event : lower(event)
    userid : cast(replace(userid,'"') as int64)
    price : safe_cast(price as float64)
    contentid = safe_cast(contentid as int64)
  Schema:
    Template: clickstream._template_table
    Split:
      ClusterColumns:
        - event
      Mapping:
        - When: event = 'pageview'
          Then: clickstream.pageview
        - When: event = 'addtobasket'
          Then: clickstream.addtobasket
        - When: event not in ('pageview','addtobasket')
          Then: clickstream.other_events
OnSuccess:
  - Action: delete

Answer 1 · 2020-03-29T16:48:47.000Z

Transform is applied when copying data from a transient table it is simply SELECT FROM temp table enriched with transform and side input expression. Transformed is shared between the split, so you can not have a dedicated transform per split. But you can always use a CASE expression to account for variation, between destination table.
See example execution plan for split (expect.json) in https://github.com/viant/bqtail/tree/master/stage/load/test/008_table_split
The way BqTail work it runs each BigQuery Job async, which mean if one fails, the things can happen if failure is classified as recoverable:(503, connection reset by per), cloud function returns errors and will be automatically retried (BeTail CF has retry set), if however you get an internal server error, the whole ingestion process will be restarted.
If ingestion file on success delete action never run thus affected data files stay in bqtail trigger bucket.
When any ingestion process fails, it will be listed as stalled by BqMonitor (after the grace period),
in that case you can rectify the issue and replay a stalled datafile or a process.

All that said for non-recoverable error you may end up with duplication so that has to be taken care downstream.

Dest.Table in split has no role to play so it can be the same as a template.

Answer 2 · 2020-03-29T16:56:28.000Z

Add #bqtail slack channel on gophers.slack.com

Answer 3 · 2020-03-29T20:48:33.000Z

Transform is applied when copying data from a transient table it is simply SELECT FROM temp table enriched with transform and side input expression. Transformed is shared between the split, so you can not have a dedicated transform per split. But you can always use a CASE expression to account for variation, between destination table.
See example execution plan for split (expect.json) in https://github.com/viant/bqtail/tree/master/stage/load/test/008_table_split

Wonderful, I didnt pay attention to expect files before.

The way BqTail work it runs each BigQuery Job async, which mean if one fails, the things can happen if failure is classified as recoverable:(503, connection reset by per), cloud function returns errors and will be automatically retried (BeTail CF has retry set), if however you get an internal server error, the whole ingestion process will be restarted.
If ingestion file on success delete action never run thus affected data files stay in bqtail trigger bucket.
When any ingestion process fails, it will be listed as stalled by BqMonitor (after the grace period),
in that case you can rectify the issue and replay a stalled datafile or a process.

All that said for non-recoverable error you may end up with duplication so that has to be taken care downstream.

Can we add EventID as a new column to splits?
So it will be possible to remove inserted rows from stalled job for avoiding duplication.
(it may be require scan whole table but it is not concern at flat-rate plan)

And as I understand from expect.json split's inserts are runs sequentially right?
When we set async=true, will they all start at the same time? (after transient table loaded.)

Answer 4 · 2020-03-29T21:29:21.000Z

Yes you can add eventID or any other column with $EventID expression
and run extra deduplication in the split clause, EvenID should stick to the original process, consider adding EventID as cluster filed to manage performance/cost.

expect.json is an example of an ingestion process execution plan for rule and test process,
Async means that on success or on failure section runs only if the proceeding task has been completed with error or success.
BqTail cloud function never waits in async mode fo BigQuery Job completion, it just submits BQ jobs and quits.
BqDispatcher actively check for completed BigQuery jobs to notify BqTail process with post action.
Post action can be nested without limits.

Answer 5 · 2020-03-29T23:48:50.000Z

I think Dest.Table can be used as template for split dest tables.