harrystech/arthur-redshift-etl

Refresh Pipeline Ends after Extract Retries

bhtucker opened this issue · 2 comments

Summary

Using the refresh pipeline with non-zero retry attempts for extract, failed Sqoop jobs that Arthur goes on to correct will lead the 'extract' step to show a status of FAILED. Thus, even though data may be fully extracted, Data Pipeline will cascade the failure and arthur.py update will not run.

Details

This failure definition behavior doesn't seem to be configurable in AWS. However, the cascading failure behavior is.

Context

Arthur ETL uses AWS Data Pipeline to orchestrate extract-load-transform steps. Extract uses EMR to dump data from source tables in parallel.

Rather than only entering a FAILED state when the submitted command exits with failure, an EMR step will be FAILED if any map-reduce job it runs fails. This means that when an Arthur extract attempt fails and then is retried successfully, the EMR step (and thus Data Pipeline Activity) both fail.

Objective

Because this behavior FAILED state logic is not configurable and retrying extract is often effective, we want to reformulate data pipelines so that they will still accomplish their business function without depending on that step status.

Current state

Currently, how we use that step varies by pipeline.

Rebuild pipeline

Extract is not required, as load does not depend on it (in order to run concurrently). Load checks for DynamoDB events showing successful extracts for each source table.

If required relations have failed to extract or if extract does not send events for all expected relations, load fails.

Pizza loader pipeline

Extract is expected to have completed in advance, so there is no extract step.

If required relations do not have manifests in place, load fails.

Refresh pipeline

Extract is an outright dependency. Failing extract ends the pipeline.

Potential changes

With a couple changes, pipelines could proceed despite failure of extract steps:

  • Dependency without cascading failure: Data Pipeline dependencies control the order of execution of steps, but apparently do not necessarily cascade failures to dependencies. See https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-cascade-failandrerun.html. Potentially by setting failureAndRerunMode to false or none, dependent steps will run after a step fails.

  • Check DynamoDB for source extraction in update: Rather than rely on extract success to signal that all targeted sources are ready to load, arthur.py update can check for recent successful extracts. The contract of a refresh pipeline could be a 'best effort' data update (ie, refresh all selected tables that have recent extracts) or simply enforce all relations are extracted (like we do today).

Tradeoffs

If we make the changes proposed above (and opt for refresh to quit if any selected sources do not have recent extracts), what concerns do we have?

  • how do we detect extract failures? Load/update steps, which are set to cascade failure, will exit when required extracts aren't present.
  • what does the pipeline status (in AWS console data pipeline list) indicate? It will show failure for pipelines with any failed steps, so it will indicate if any extract steps required retry (or if steps outright failed).
  • does this change to update make is less useful as a standalone command? Perhaps the current behavior (assume manifests on S3 are ready to load) can remain available behind a flag, or the 'check for recent extracts' can get a flag. Should be easy to serve both use cases.

Closing since a change in the refresh pipeline was implemented with PR #88

For refresh, the check against extract events is now based on (and turned on by):

--scheduled-start-time #{@scheduledStartTime}

(Without this switch, update expects manifests to exist at launch time of the command.)

For rebuild, the check is enabled by:

--concurrent-extract

We should also use the start time for rebuilds and have the same command line args. (We had some issues lately where the 15min guess turned out to cause a failure in the pipeline.)

As for the "failed" extract steps, we should specifically call this out in play books to avoid confusion.

Also, the check against extract events should be available as a standalone command to aid with triage or debugging.