mapbox/dynamodb-replicator

Documentation: real world user documentation

Opened this issue ยท 8 comments

would it be possible to get some real world complete setup examples for using this tool?

Before I go further - I appreciate the hard work involved in getting this tool this far, and don't take my critical comments below in the wrong context. I'm trying to help improve the user experience for possibly the best and only complete replication and backup/restore tool for dynamo that exists today!! :)

The current documentation is just a marketing glossy - a user doesn't even have a feature-to-tool map. Where is "A replicator function that processes events from a DynamoDB stream" ? What file? I have to go become an expert in node, lambda, ddb streams, streambot to be able to consume this project.

A more complete walkthrough (even if it's only an example setup) would be great - aws cli, aws web console, just something to help a user understand all of the pieces required (and to know how to skip the pieces that are not required).

How do I setup/config DDB's streams for the purpose of using this tool?

How do I setup lambda?

IAM as it applies to the tool and relate services?

s3 as it relates to the tool? (specific bucket setup requirements for this usage?)

Other areas of concern - How does using the tool for replication, and for backups, impact ddb scaling in various scenarios?

thanks - I'd love to use it, but trying to figure out how to use this is going to be a huge undertaking and trial/error event.....

Even just a high level sketch of the pieces, without tons of detail, would be a great starting point for us to contribute to.

Hi! Thanks for the input. The documentation is definitely not amazing and setup is non-trivial. Did you see https://github.com/mapbox/dynamodb-replicator/blob/master/DESIGN.md? Does this answer any of you questions and/or help focus them to be a little more specific?

I have read design but this is still totally unclear of how to deploy all the things.

I concur - I was able to get it up and running, but not without some friction.

This is what I had to do:

First: Create a trigger on the table you want to replicate. For the lambda function code:

git clone https://github.com/mapbox/dynamodb-replicator
cd dynamodb-replicator
npm install
  • Then zip up the contents of the dynamodb-replicator folder and use that as your lambda function code.
  • For the handler specify: index.streambotReplicate

Second: Setup streambot

  • This has to be setup in each region where you have a dynamodb "master" table to replicate
  • It wasn't immediately clear this was necessary, and may not totally be, but seems to work the best.
  • You can think of streambot as a "middleware" that sets the process.env variables within your lambda. It does this by maintaining the configuration in dynamodb table.
  • mapbox has a cloudformation template to setup the necessary bits
  • To get the template file for the cloud formation:
git checkout https://github.com/mapbox/streambot
cd streambot
npm install
npm run-script build
  • Go to Cloud Formation and click Create Stack
  • Upload the template file from the output above: ./cloudformation/streambot.template
  • For the gitsha parameter, grab the latest gitsha from the streambot release.
    cf7729fbfffec9796f2b2aa063d2a3e871ba626d was the latest at the time I did this.
  • Once the cloud formation has completed, you'll have a streambot-env dynamodb table. You'll need to add records to that table for each dynamodb table you want to replicate. I just used the web console. The primary partition key ("name") is the name of your lambda function. The record needs to have an env property which value is a json string. Something like: {"ReplicaRegion":"us-east-1","ReplicaTable":"MyDynoTable"}

Perhaps there is a more snappy, sexy way to get this up and running, but I didn't discover it. Ultimately if you're having to dive into the code and make changes (as I started to do before I understood what streambot was doing), you're probably doing something wrong. Also, I couldn't get the non-streambot version to work (i.e. just using index.replicate as the handler).

@brendonparker thank you for digging in here! A couple of notes:

This has to be setup in each region where you have a dynamodb "master" table to replicate. It wasn't immediately clear this was necessary, and may not totally be, but seems to work the best.

Yes, you need streambot to run in every region where there will be a lambda function pushing data from your primary table to your replica.

You'll need to add records to that table for each dynamodb table you want to replicate.

This is one way to do it. I would advise using a CloudFormation template to deploy the Lambda function that will be performing the replication. Aside from the table that streambot creates, it also creates a Lambda function that can be used to power a custom CloudFormation resource. By adding a resource like https://github.com/mapbox/streambot#streambotenv to your template, Streambot's lambda function will update the streambot-env table for you.

Perhaps there is a more snappy, sexy way to get this up and running, but I didn't discover it.

You've pretty much landed on the state-of-the-art. Would you be interested in converting your comment above into a PR for a setup doc in this repo?

Thanks for the clarifications/validations.

I was just about to spend some time digging into the cloudformation/streambot lambda to see if I could wrap my head around that.

BTW - Didn't mean to imply that it wasn't snappy, sexy (the cloudformation template was pretty sexy ๐Ÿ˜). I just wasn't sure that I was doing it the right way.

Once I play with it a little more and have a better understanding I'll try to condense my notes down to a more formal step-by-step setup.

I'm back on this again, since snapshot based backups don't work well with autoscaling... I'm going to keep a rough track of what I discover along the way.

backup-table running locally works pretty straight forward:

npm install
AWS_PROFILE=profile bin/backup-table.js us-west-2/dsr-test s3://dsr-ddb-rep-testing

Ok, quick and dirty for the moment to help anyone who is trying to get this running without streambot. I hope to get some PRs and docu together, but in case work happens before I get a chance:

Note that I have not tested the replication functionality at all and am only providing it below a theory.....

Add a new function to replicate how streambot executes the core functions - add this to index.js - I dropped it below the other module.exports myself. (you could probably do this by adding a file in parallel, etc - as i said, quick and dirty.)

module.exports.lambdaenvReplicate = LambdaEnv(replicate);
module.exports.lambdaenvBackup = LambdaEnv(incrementalBackup);

var fs = require('fs');
function LambdaEnv(service) {
  return function streambot(event, context) {
    console.log('Start time: %s', (new Date()).toISOString());
    console.log('Event md5: %s', crypto.createHash('md5').update(JSON.stringify(event)).digest('hex'));
    // for more debugging
    // console.log('event = ', JSON.stringify(event, null, 2) );
    // console.log('context = ', JSON.stringify(context, null, 2) );
    var callback = context.done.bind(context);
    if (fs.existsSync('config.env')) {
      require('dotenv').config({path: 'config.env'})
      console.log('Loaded environment from config.env');
    }
    service.call(context, event, callback);
  };
}

Optionally, create a config.env with some/all of the config you might want your functions to consume.

Config - as environment variables, either directly in the lambda function in config.env

## for backup functions
# target bucket
BackupBucket=
# optional prefix inside the bucket
BackupPrefix=
# optional region for the bucket
BackupRegion=

## for replication functions
# optional key/secret for writing to a different AWS account
ReplicaAccessKeyId=
ReplicaSecretAccesseKey=
# I haven't looked...
ReplicaTable=
ReplicaRegion=
ReplicaEndpoint=

Build your .zip to upload to your lambda function. (I use grunt_) Setup your lambda function's handler to call the "index.lambdaenvBackup" or "index.lambdaenvReplicate" exported functions.

Update your dynamodb table to enable streams - via the UI, select your table, then choose manage stream - you'll want New and old images. This defines the data that the function picks up in it's event json when it's triggered.

Now, add a trigger to your dynamodb table in the Triggers tab of the ddb UI. Select "new function" or "existing function". Set your batch size to something that makes sense - I haven't determined this yet for my use cases. This is the maximum number of items to send to the lambda function at a time, which will impact your runtime vs number of executions....

You'll need to create/setup an IAM role policy for the lambda function to be able to talk to the resources it needs to talk to.

For backups this works for me so far. Insert your s3 bucket names and restrict ddb tables as appropriate....

{
      "Version": "2012-10-17",
      "Statement": [
          {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
          },
          {
            "Effect": "Allow",
            "Action": [
                    "s3:ListBucket"
            ],
            "Resource": [
              "arn:aws:s3:::<YOURBUCKET>"
            ]
          },
          {
            "Effect": "Allow",
            "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
            ],
            "Resource": [
              "arn:aws:s3:::<YOURBUCKET>/*"
            ]
          },
          {
              "Effect": "Allow",
              "Action": "lambda:InvokeFunction",
              "Resource": "*"
          },
          {
              "Effect": "Allow",
              "Action": [
                  "dynamodb:DescribeStream",
                  "dynamodb:GetRecords",
                  "dynamodb:GetShardIterator",
                  "dynamodb:ListStreams"
              ],
              "Resource": "*"
          }

      ]
}

At this point, updates to your ddb table should trigger the function - you'll see logs in cloudwatch logs to reflect this. If everything works, you should start to see things show in s3://bucket/prefix/tablename/ as well.

Related: Lambda functions can now have environment variables set as part of their configuration: https://aws.amazon.com/about-aws/whats-new/2016/11/aws-lambda-supports-environment-variables/

Fallout from this is that streambot is deprecated -- it is no longer needed as a shim for this. In the near future I'll be coming back through this repository to replace streambot entirely. This may also include adding a CloudFormation template to really ease the burden of setting this up.