/event-manifest-cleaner

A Spark job that takes records straight from the failed enriched good directory and deletes exactly those from DynamoDB

Primary LanguageScalaApache License 2.0Apache-2.0

Event Manifest Cleaner

Introduction

This is an Apache Spark job to cleanup a Snowplow event manifest in DynamoDB for particular Enrich job result. This job solves the problem, where Shred job half populates event manifest, but pipeline failed downstream. If engineer will try to recover pipeline by running Shred job again - most of events will be mistakenly marked as duplicates which will lead to data loss in Redshift.

Usage

Detailed usage case can be found in Recovering pipelines with cross-batch deduplication.

In order to use Event Manifest Cleaner, you need to have boto2 installed:

$ pip install boto

Now you can run Event Manifest Cleaner with a single command (inside event-manifest-cleaner directory):

$ python run.py run_emr $ENRICHED_EVENTS_DIR $STORAGE_CONFIG_PATH $IGLU_RESOLVER_PATH

The user running it should have the dynamodb:DeleteTable rights for the related table.

Task has three required arguments:

  1. Path to enriched events directory. This can be not archived directory in enriched.good or in rare cases particular directory in enriched.archive.
  2. Local path to Duplicate storage configuration JSON
  3. Local path to Iglu resolver configuration JSON

Optionally, you can also pass following options:

  • --time ETL time for orphan enriched events
  • --log-path to store EMR job logs on S3. Normally, Manifest Cleaner does not produce any logs or output, but if some error occured you'll be able to inspect it in EMR logs stored in this path.
  • --profile to specify AWS profile to create this EMR job.
  • --jar to specify S3 path to custom JAR

Building

Assuming git, Vagrant and VirtualBox installed:

host$ git clone https://github.com/snowplow/event-manifest-cleaner
host$ cd event-manifest-cleaner
host$ vagrant up && vagrant ssh
guest$ cd /vagrant
guest$ sbt assembly

Copyright and License

Copyright 2017 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.