awslabs/dynamodb-cross-region-library

Enhancement and Suggestion Needed

freedomofkeima opened this issue · 7 comments

Hi,

After several weeks using cross-region replication library, I noticed several missing features here.

  1. Configurable CPU units and Memory size in replication console [important]

I noticed that DynamoDBReplicationConnector requires 256 CPU units and 512 MB of memory. However, "Out of Memory" issue occurs frequently.

Reason OutOfMemoryError: Container killed due to memory usage

screen shot 2016-01-13 at 6 56 23 pm

I've decided to add some instances to my autoscaling group, but the tasks are not distributed properly. As a consequence, there is a short failure timespan (2 - 4 minutes) for each failure.

screen shot 2016-01-13 at 6 57 26 pm

Any workaround for this issue (without updating ECS task manually)?

  1. Issues in CloudFormation template [important]

I've consulted it with ECS team (aws/amazon-ecs-agent#277 (comment)) and they suggest to use ecs-init in your template.

In addition, the provided SSH keypair in CloudFormation is not working for the replication coordinator component. It only works for the connectors.

screen shot 2016-01-13 at 7 13 22 pm

screen shot 2016-01-13 at 7 13 37 pm

As the consequence, I cannot access my replication coordinator directly, except if I modify the provided template.

  1. Configurable throughput for KCL and metadata tables via replication console

We still can change it from DynamoDB page, but it's better if we can modify it from the replication console at the beginning.

  1. Add an option to replicate GSI and LSI from master table via replication console

  2. Add CloudFormation template support for t-class instances (since they require private VPC)

Thank you!

jseed commented

Issue 2 is in my opinion a big one. Without this fix, ECS metrics don't get reported. Thank you for linking to that issue, saved me a lot of time getting my autoscaling working.

I have a question about issue 1. Is the outofmemory error happening because a running task is using too much memory, or when you are trying to add a task? I ask because we are planning to use this solution for a lot of high throughput tables in production, and that could be a huge problem.

As primary workaround to avoid out of memory error, I added a swap file to each ECS instance. Until now, everything is working properly.

jseed commented

Is this something that you manually do to each instance, or have you automated it?
I ask because my current solution implements autoscaling, and if you already have this process automated that would make my life a lot easier!

Sorry for late responding.

I just edited the template to automated it. When creating the "EcsInstanceLc", after init docker, I added some commands to create a swap file. Take a look in following code. :)

"EcsInstanceLc": {
      "Type": "AWS::AutoScaling::LaunchConfiguration",
      "Properties": {
        "ImageId": {
          "Fn::FindInMap": [
            "AWSRegionArch2AMI",
            {
              "Ref": "AWS::Region"
            },
            {
              "Fn::FindInMap": [
                "AWSInstanceType2Arch",
                {
                  "Ref": "EcsInstanceType"
                },
                "Arch"
              ]
            }
          ]
        },
        "InstanceType": {
          "Ref": "EcsInstanceType"
        },
        "InstanceMonitoring": "false",
        "IamInstanceProfile": {
          "Ref": "EcsInstanceProfile"
        },
        "KeyName": {
          "Fn::If": [
            "KeyNameExists",
            {
              "Ref": "KeyName"
            },
            {
              "Ref": "AWS::NoValue"
            }
          ]
        },
        "SecurityGroups": [
          {
            "Ref": "EcsInstanceSecurityGroup"
          }
        ],
        "UserData": {
          "Fn::Base64": {
            "Fn::Join": [
              "",
              [
                "#!/bin/bash\n",
                "\n",
                "ECS_CLUSTER=\"DynamoDBCrossRegionReplication\"\n",
                "\n",
                "# Create ECS cluster\n",
                "\n",
                "yum update -y \n",
                "\n",
                "aws --region ",
                {
                  "Ref": "AWS::Region"
                },
                " ecs create-cluster --cluster $ECS_CLUSTER\n",
                "\n",
                "yum install docker -y\n",
                "\n",
                "/sbin/service docker start\n",
                "\n",
                "# Start ECS agent\n",
                "# See: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html\n",
                "\n",
                "docker run --name ecs-agent -d \\\n",
                "-v /var/run/docker.sock:/var/run/docker.sock \\\n",
                "-v /var/log/ecs/:/log -p 127.0.0.1:51678:51678 \\\n",
                "-v /var/lib/ecs/data:/data \\\n",
                "-e ECS_LOGFILE=/log/ecs-agent.log \\\n",
                "-e ECS_LOGLEVEL=info \\\n",
                "-e ECS_DATADIR=/data \\\n",
                "-e ECS_CLUSTER=$ECS_CLUSTER \\\n",
                "amazon/amazon-ecs-agent:latest\n",
                "\n",
                "echo \"*/5 * * * * root docker rm \\$(docker ps -a -f status=exited -q)\" >> /etc/crontab\n",
                "\n",
                "# Create swap file\n",
                "\n",
                "fallocate -l 4G /mnt/4GB.swap\n",
                "mkswap /mnt/4GB.swap\n",
                "swapon /mnt/4GB.swap\n",
                "\n"
              ]
            ]
          }
        }
      }
    },

Hi,

As a part of the improvement process for cross-region replication library, we have refactored the library completely and released an updated version. As a result, the previous version involving multiple components has been deprecated, I believe this makes most of the issues described here obsolete since we have stripped away ECS, Elastic Beanstalk, CloudFormation and the Console UI.

For KCL related issues and feature requests, please open issues against the relevant repository here.

Thank you!

oli-g commented

Hi @dymaws

I think there are still a lot of people relying on the "deprecated" and "obsolete" Dynamo CRR setup. We're operating and maintaining a CRR ECS cluster (provisioned the "old" way through CloudFormation and Beanstalk) in a production environment, and we're experiencing the OutOfMemoryError: Container killed due to memory usage issue.

Should we manually update all the task definitions in order to set a higher memory hard limit? We're running 10 m3.large instances, and every instance has ~6000MB available memory: nevertheless tasks are continuously being killed and restarted due to OutOfMemoryError.

So I think this issue should be reopened, and a valid solution should be suggested to all the people still using the "obsolete" CRR cluster in a production environment.

Moreover, would the issue disappear in case we decide to move to the "new" solution? Still not clear which is the "new" solution however: I think a clear step by step guide is missing, I don't think that this README is enough for most of the users out there.

Thank you!

I also came across the same problem(Reason OutOfMemoryError: Container killed due to memory usage), and found out that hard limit of memory was set too small by myself in the task definition. Change the memory limit to soft limit or higher can fix it.