/swarm-scan

dynamodb scanning made faster, with more bees!

Primary LanguageJavaScript

Swarm Scan

A distributed DynamoDB parallel scanner

This repo is a one-size fits most solution for data extraction, transform, and load (ETL) tasks for data stored inside AWS DynamoDB tables with on-demand provisioning enabled*. If you have suggestions, feel free to open up an issue, or better yet, a pull request.

Configuration and Usage

Usage is pretty simple, but requires a teeny bit of setup. Firstly, ensure you have the serverless package installed and configured for your deployment target. You can find everything you need here on their installation documentation.

Second, make sure you have your AWS account number handy as it's required for url and arn construction. Set that to an environment variable AWS_ACCOUNT_NUMBER.

Third, update the defaults! There are is a reasonable default value for limit, and aws region is defaulted to us-east-1, but in serverless.yml there's a section for the launcher function that is set to trigger based on a timed task. You can create multiple entries in the events schedule section, each with a different table. You can (and probable should) set it up on a cron formatted time schedule, or base it on some other event.

Lastly, write your own function for scanner.js. This is a demo, more or less, and doSomethingWithData doesn't do diddly until you give it something to do.

Doing the thing:

serverless deploy [--stage][--limit][--region]

Command Line Options

Flag Type Required Description Default
--stage string false the stage, or "environment" in which this will run, e.g. staging, prod development
--limit integer false the estimated maximum number of records for each scan worker to handle 1000
--region string false the AWS Region to deploy this function us-east-1

What it does

This is only a template for a serverless-deployed AWS Lambda application that will conduct a parallel scan on your chosen DynamoDB tables. There are two Lambda functions, the first, (launcher.js,) retrieves details about your table for a rough item count, divides that number by your configured limit option (approximate max number of items you would like each scan worker to read), and writes messages to SQS with configuration information for the scan workers. The second function, (scanner.js,) is the scan worker itself. It is configured in serverless.yml to recieve a single SQS message at a time, which contains information related to the scan operation.

DynamoDB allows for parallel scans by providing parameters for TotalSegments and (this, the current) Segment options. The actual number of records each worker will scan is determined by DynamoDB's api, and only loosely controlled by the limit option.

* please note that there are still limitations as to how quickly DynamoDB can scale up read provisions, even with on-demand provisioning. You should test this before you let this thing go ham!