Q: How do you handle split operations in case of failure
ktopcuoglu opened this issue · 5 comments
I try to follow behaviour from codes but i lost.
When I use split
and transform
config for append incoming data across different final tables.
-
When are the
Transform
operations applied? as i understand it, these operations working when inserting to the target table from transient table. So I can't modify transforms per split table right? Also do all tables have to have the same schema? -
When most of the
Split
operations succeed but one of them failed with some reason, (temporary internal error or permanent error like column type mismatch, or atransform
error) what is the expected behaviour? -
When using transient table, what is the usage/role of
Dest.Table
?
When:
Prefix: /data/clickstream/
Suffix: .json.gz
Async: true
Dest:
Table: bqtail.dummy
Transient:
Dataset: temp
Transform:
event : lower(event)
userid : cast(replace(userid,'"') as int64)
price : safe_cast(price as float64)
contentid = safe_cast(contentid as int64)
Schema:
Template: clickstream._template_table
Split:
ClusterColumns:
- event
Mapping:
- When: event = 'pageview'
Then: clickstream.pageview
- When: event = 'addtobasket'
Then: clickstream.addtobasket
- When: event not in ('pageview','addtobasket')
Then: clickstream.other_events
OnSuccess:
- Action: delete
-
Transform is applied when copying data from a transient table it is simply SELECT FROM temp table enriched with transform and side input expression. Transformed is shared between the split, so you can not have a dedicated transform per split. But you can always use a CASE expression to account for variation, between destination table.
See example execution plan for split (expect.json) in https://github.com/viant/bqtail/tree/master/stage/load/test/008_table_split -
The way BqTail work it runs each BigQuery Job async, which mean if one fails, the things can happen if failure is classified as recoverable:(503, connection reset by per), cloud function returns errors and will be automatically retried (BeTail CF has retry set), if however you get an internal server error, the whole ingestion process will be restarted.
If ingestion file on success delete action never run thus affected data files stay in bqtail trigger bucket.
When any ingestion process fails, it will be listed as stalled by BqMonitor (after the grace period),
in that case you can rectify the issue and replay a stalled datafile or a process.
All that said for non-recoverable error you may end up with duplication so that has to be taken care downstream.
- Dest.Table in split has no role to play so it can be the same as a template.
Add #bqtail slack channel on gophers.slack.com
- Transform is applied when copying data from a transient table it is simply SELECT FROM temp table enriched with transform and side input expression. Transformed is shared between the split, so you can not have a dedicated transform per split. But you can always use a CASE expression to account for variation, between destination table.
See example execution plan for split (expect.json) in https://github.com/viant/bqtail/tree/master/stage/load/test/008_table_split
Wonderful, I didnt pay attention to expect files before.
- The way BqTail work it runs each BigQuery Job async, which mean if one fails, the things can happen if failure is classified as recoverable:(503, connection reset by per), cloud function returns errors and will be automatically retried (BeTail CF has retry set), if however you get an internal server error, the whole ingestion process will be restarted.
If ingestion file on success delete action never run thus affected data files stay in bqtail trigger bucket.
When any ingestion process fails, it will be listed as stalled by BqMonitor (after the grace period),
in that case you can rectify the issue and replay a stalled datafile or a process.All that said for non-recoverable error you may end up with duplication so that has to be taken care downstream.
Can we add EventID
as a new column to splits?
So it will be possible to remove inserted rows from stalled
job for avoiding duplication.
(it may be require scan whole table but it is not concern at flat-rate plan)
And as I understand from expect.json
split's inserts are runs sequentially right?
When we set async=true, will they all start at the same time? (after transient table loaded.)
Yes you can add eventID or any other column with $EventID expression
and run extra deduplication in the split clause, EvenID should stick to the original process, consider adding EventID as cluster filed to manage performance/cost.
expect.json is an example of an ingestion process execution plan for rule and test process,
Async means that on success or on failure section runs only if the proceeding task has been completed with error or success.
BqTail cloud function never waits in async mode fo BigQuery Job completion, it just submits BQ jobs and quits.
BqDispatcher actively check for completed BigQuery jobs to notify BqTail process with post action.
Post action can be nested without limits.
I think Dest.Table can be used as template for split dest tables.