Refresh Pipeline Ends after Extract Retries
bhtucker opened this issue · 2 comments
Summary
Using the refresh pipeline with non-zero retry attempts for extract, failed Sqoop jobs that Arthur goes on to correct will lead the 'extract' step to show a status of FAILED
. Thus, even though data may be fully extracted, Data Pipeline will cascade the failure and arthur.py update
will not run.
Details
This failure definition behavior doesn't seem to be configurable in AWS. However, the cascading failure behavior is.
Context
Arthur ETL uses AWS Data Pipeline to orchestrate extract-load-transform steps. Extract uses EMR to dump data from source tables in parallel.
Rather than only entering a FAILED
state when the submitted command exits with failure, an EMR step will be FAILED
if any map-reduce job it runs fails. This means that when an Arthur extract attempt fails and then is retried successfully, the EMR step (and thus Data Pipeline Activity) both fail.
Objective
Because this behavior FAILED
state logic is not configurable and retrying extract
is often effective, we want to reformulate data pipelines so that they will still accomplish their business function without depending on that step status.
Current state
Currently, how we use that step varies by pipeline.
Rebuild pipeline
Extract is not required, as load does not depend on it (in order to run concurrently). Load checks for DynamoDB events showing successful extracts for each source table.
If required relations have failed to extract or if extract does not send events for all expected relations, load fails.
Pizza loader pipeline
Extract is expected to have completed in advance, so there is no extract step.
If required relations do not have manifests in place, load fails.
Refresh pipeline
Extract is an outright dependency. Failing extract ends the pipeline.
Potential changes
With a couple changes, pipelines could proceed despite failure of extract
steps:
-
Dependency without cascading failure: Data Pipeline dependencies control the order of execution of steps, but apparently do not necessarily cascade failures to dependencies. See https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-cascade-failandrerun.html. Potentially by setting
failureAndRerunMode
tofalse
ornone
, dependent steps will run after a step fails. -
Check DynamoDB for source extraction in
update
: Rather than rely on extract success to signal that all targeted sources are ready to load,arthur.py update
can check for recent successful extracts. The contract of a refresh pipeline could be a 'best effort' data update (ie, refresh all selected tables that have recent extracts) or simply enforce all relations are extracted (like we do today).
Tradeoffs
If we make the changes proposed above (and opt for refresh to quit if any selected sources do not have recent extracts), what concerns do we have?
- how do we detect extract failures? Load/update steps, which are set to cascade failure, will exit when required extracts aren't present.
- what does the pipeline status (in AWS console data pipeline list) indicate? It will show failure for pipelines with any failed steps, so it will indicate if any extract steps required retry (or if steps outright failed).
- does this change to
update
make is less useful as a standalone command? Perhaps the current behavior (assume manifests on S3 are ready to load) can remain available behind a flag, or the 'check for recent extracts' can get a flag. Should be easy to serve both use cases.
Closing since a change in the refresh pipeline was implemented with PR #88
For refresh, the check against extract events is now based on (and turned on by):
--scheduled-start-time #{@scheduledStartTime}
(Without this switch, update
expects manifests to exist at launch time of the command.)
For rebuild, the check is enabled by:
--concurrent-extract
We should also use the start time for rebuilds and have the same command line args. (We had some issues lately where the 15min guess turned out to cause a failure in the pipeline.)
As for the "failed" extract steps, we should specifically call this out in play books to avoid confusion.
Also, the check against extract events should be available as a standalone command to aid with triage or debugging.