- Overview
- Prerequisites
- Deployment Steps
- Deployment Validation
- Running the Guidance
- Next Steps
- Cleanup
Optional
This Guidance demonstrates how to efficiently migrate data from self-managed Apache Cassandra clusters to Amazon Keyspaces using CQL Replicator in near real-time.CQLReplicator is an open source utility built by AWS Solutions Architects that helps customers migrate data from self-managed Apache Cassandra to Amazon Keyspaces in near real time.
The objective of this guidance is to support customers in seamlessly migrating data from self-managed Apache Cassandra clusters to Amazon Keyspaces using CQL replicator. CQLReplicator is an open source utility built by AWS Solutions Architects that helps customers migrate data from self-managed Apache Cassandra to Amazon Keyspaces in near real time.Included sample code features CloudFormation templates that significantly reduce the complexity of setting up key components such as VPC, Subnets, Security groups, IAM roles and Cassandra cluster, reducing manual configuration efforts. These templates, along with additional steps allows you to load data into Apache Cassandra cluster and migrate the data using CQLReplicator.
Architecture diagram:
You are responsible for the cost of the AWS services used while running this Guidance.
As of 09/11/2024, the cost for running this guidance with the default settings in the US East (N. Virginia) is approximately $434.57 per month for creating resources (Three node Cassandra Cluster, Cassandra client EC2 instance, Amazon Keyspaces and AWS Glue job) using guidance cloudformation templates and migrating 64K records in a Cassandra table to Amazon Keyspaces using AWS Glue.
The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.
AWS service | Dimensions | Monthly Cost [USD] |
---|---|---|
AWS Glue | Number of DPUs for Apache Spark job (10), Number of DPUs for Python Shell job (0.0625) | $ 4.40 |
Amazon Elastic Cloud Compute (Amazon EC2) - (Cassandra client instance) | Tenancy (Shared Instances), Operating system (Ubuntu Pro), Workload (Consistent, Number of instances: 1), Advance EC2 instance (t2.medium), Pricing strategy ( 3yr No Upfront), Enable monitoring (enabled), EBS Storage amount (50 GB), DT Inbound: Not selected (0 TB per month), DT Outbound: Not selected (0 TB per month), DT Intra-Region: (100 GB per month) | $ 24.38 |
Amazon Elastic Cloud Compute (Amazon EC2) - (cassandra nodes) | enancy (Shared Instances), Operating system (Ubuntu Pro), Workload (Consistent, Number of instances: 3), Advance EC2 instance (t2.2xlarge), Pricing strategy ( 3yr No Upfront), Enable monitoring (enabled), EBS Storage amount (100 GB), DT Inbound: Not selected (0 TB per month), DT Outbound: Not selected (0 TB per month), DT Intra-Region: (100 GB per month) | $ 403.51 |
Amazon Keyspaces | LOCAL_ONE reads (0), LOCAL_QUORUM reads (1), PITR Storage (Enabled), Storage (1 GB), Number of writes (1000000 per month), Number of reads (1000000 per month), Number of TTL delete operations (0 per month) | $ 2.28 |
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.
- The AWS CLI installed.
- Access to deploy Cloudformation template and create resources (AWS Glue, Amazon Keyspaces, EC2, VPC, Subnets, Amazon S3, Security groups, IAM roles and Policies)
- Install Git to Clone Repository
- Mac or Linux or Amazon Linux environment can be used to run or deploy this Guidance.
This deployment requires that you have access to the following AWS services:
- Amazon Simple Storage Service (Amazon S3)
- AWS Glue
- Amazon Keyspaces
- Amazon Elastic Compute Cloud (Amazon EC2)
- Amazon Virtual Private Cloud (Amazon VPC)
- AWS Identity and Access Management (IAM)
- AWS Cloudformation
- Amazon Cloudwatch
This guidance can be deployed in any AWS Regions where Amazon Keyspaces is supported. You can find list of Amazon Keyspaces Service Endpoints for each AWS region from here [link] (https://docs.aws.amazon.com/general/latest/gr/keyspaces.html)
These deployment instructions are optimized to best work on Mac or Amazon Linux 2023. Deployment in another OS may require additional steps.
- Clone the repo using command
git clone https://github.com/aws-solutions-library-samples/guidance-for-near-real-time-data-migration-from-apache-cassandra-to-amazon-keyspaces
- cd to the repo templates folder
cd guidance-for-near-real-time-data-migration-from-apache-cassandra-to-amazon-keyspaces/deployment/templates
- Configure AWS CLI environment by setting below values. Make sure to replace the place holders with your environment specific values
export AWS_REGION=<AWS Region>
export AWS_ACCOUNT_ID=<AWS Account ID>
export AWS_ACCESS_KEY_ID=<AWS ACCESS KEY>
export AWS_SECRET_ACCESS_KEY=<AWS SECRET ACCESS KEY>
- Run below command to create S3 bucket
aws s3api create-bucket --bucket cql-replicator-$AWS_ACCOUNT_ID-$AWS_REGION
- Run below command to create EC2 Key Pair
aws ec2 create-key-pair --key-name my-cass-kp --query 'KeyMaterial' --output text > my-cass-kp.pem
Note: output file my-cass-kp.pem
content can be later used to Cassandra EC2 instances from Cassandra Client EC2 instance. So save the file.
- Now run below command to deploy cloudformation template to create new VPC, Subnets, Security groups, Cassandra Client EC2 instance, Amazon Keyspaces Keyspace and table and IAM roles with policies.You can check progress from Cloudformation console.
aws cloudformation deploy --template-file cfn-vpc-ks.yml --stack-name cfn-vpc-ks-stack --parameter-overrides KeyName=my-cass-kp --tags purpose=vpc-ks-iamroles-creation --s3-bucket cql-replicator-$AWS_ACCOUNT_ID-$AWS_REGION --capabilities CAPABILITY_NAMED_IAM
- Once Cloudformation stack
cfn-vpc-ks-stack
is finished, then run below command to capture output of stack into a file.
aws cloudformation describe-stacks --stack-name cfn-vpc-ks-stack --query "Stacks[0].Outputs[*].[OutputKey,OutputValue]" --output text > stack_resources_output
-
Now pick values of
CassandraVPCId
,PrivateSubnetOne
,PrivateSubnetTwo
,PrivateSubnetThree
andCassandraClientInstanceSecurityGroupID
from output filestack_resources_output
-
Now pass values from
Step 8
to cloudformation command deploy parameters as mentioned below and deploy the cloudformation template to create Cassandra nodes
CassandraVPCId
to VpcId,PrivateSubnetOne
to Subnet1,PrivateSubnetTwo
to Subnet2,PrivateSubnetThree
to Subnet3,CassandraClientInstanceSecurityGroupID
to SourceSecurityGroup andCassandraClientInstanceSecurityGroupID
to CassandraClientSecurityGroup
aws cloudformation deploy --template-file cfn_cassandra_cluster_creation.yml --stack-name cass-cluster-stack --parameter-overrides KeyName=my-cass-kp VpcId=<value of CassandraVPCId> Subnet1=<value of PrivateSubnetOne> Subnet2=<value of PrivateSubnetTwo> Subnet3=<value of PrivateSubnetThree> SourceSecurityGroup=<value of CassandraClientInstanceSecurityGroupID> CassandraClientSecurityGroup=<value of CassandraClientInstanceSecurityGroupID> --tags purpose=cass-nodes-creation --capabilities CAPABILITY_NAMED_IAM
- Once Cloudformation stack
cass-cluster-stack
is finished, then run below command to capture output of stack into a file.
aws cloudformation describe-stacks --stack-name cass-cluster-stack --query "Stacks[0].Outputs[*].[OutputKey,OutputValue]" --output text > stack_resources_cassandra_output
Open CloudFormation console and verify the status of the templates with the names cfn-vpc-ks-stack
and cass-cluster-stack
. If deployments are successful you should be able to see VPC with subnets, Amazon EC2 Cassandra client instance, Three Amazon EC2 Cassandra nodes, Amazon Keyspaces Keyspace and a table.
Once the CloudFormation stack is deployed, Follow the below steps to configure and test the guidance.
- Copy
my-cass-kp.pem
file from step 5 of deployment to Cassandra Client EC2 instancecqlrepl-ks-cass-CassandraClientInstance
. Replace<ip_address_ec2>
with IP address of Cassandra client EC2 instancecqlrepl-ks-cass-CassandraClientInstance
.
chmod 400 my-cass-kp.pem
scp -i "my-cass-kp.pem" my-cass-kp.pem ubuntu@ec2-<ip_address_ec2>.compute-1.amazonaws.com:~/.
- Connect to Cassandra client EC2 instance
cqlrepl-ks-cass-CassandraClientInstance
usingEC2 Instance Connect
orSSH
and finish configuring Cassandra cluster, starting with Cassandra node One.
ssh to CassandraNode-One
using EC2 key-pair my-cass-kp.pem
. Make sure to replace <IP Address of CassandraNode-one>
with IP address of CassandraNode-one
.
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-one>
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cassandra
bin/nodetool status
- Stay on Cassandra node one commandline and Check CQLSH connectivity on Cassandra Node one
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cqlsh `hostname -i` -u cassandra -p cassandra
select * from system.local
- Stay on commandline and get IP Address of "CassandraNode-One" with below command to use it in other two nodes for Cassandra cluster setup
hostname -i
Now exit from CassandraNode-one
terminal and go back to EC2 instance cqlrepl-ks-cass-CassandraClientInstance
terminal
- Now configure second Cassandra node by doing ssh to
CassandraNode-Two
using EC2 key-pairmy-cass-kp.pem
.
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-Two>
- Edit
cassandra.yaml
and update value ofseeds
property with IP Address ofCassandraNode-One
fromstep-8
andsave the file
. For reference, You can see the example screenshot with seeds property.
cd /home/ubuntu/apache-cassandra-3.11.2
vi conf/cassandra.yaml
- Start Cassandra service
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cassandra
bin/nodetool status
Now exit from CassandraNode-Two
terminal and go back to EC2 instance cqlrepl-ks-cass-CassandraClientInstance
terminal
- Now configure Third Cassandra node by doing ssh to
CassandraNode-Three
using EC2 key-pairmy-cass-kp.pem
.
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-Three>
- Edit
cassandra.yaml
and update value ofseeds
property with IP Address ofCassandraNode-One
fromstep-8
andsave the file
. For reference, You can see the example screenshot with seeds property.
cd /home/ubuntu/apache-cassandra-3.11.2
vi conf/cassandra.yaml
- Start Cassandra service
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cassandra
bin/nodetool status
Note: nodetool status should show you a Three node Cassandra cluster.
- Stay on
CassandraNode-Three
commandline and createkeyspace
andtable
in Cassandra cluster.
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cqlsh `hostname -i` -u cassandra -p cassandra
create KEYSPACE aws WITH replication = {'class': 'NetworkTopologyStrategy', 'Datacenter1':'3'} and durable_writes = 'true';
CREATE TABLE aws.orders (
order_id uuid,
order_date timestamp,
product_id uuid,
quantity int,
user_id uuid,
PRIMARY KEY (order_id, order_date)
) WITH CLUSTERING ORDER BY (order_date ASC);
- Now create the cassandra stress yaml file to generate the data and load into
aws.orders
table
vi ecommerce_stress.yaml
keyspace: aws
table: orders
columnspec:
- name: order_id
size: fixed(36) # UUID size
population: uniform(1..1000000) # 1 million unique order IDs
- name: product_id
size: fixed(36)
population: uniform(1..100000) # Assuming 100,000 unique products
- name: user_id
size: fixed(36)
population: uniform(1..10000) # Assuming 10,000 unique users
- name: order_date
cluster: uniform(1..1000) # Random timestamps
- name: quantity
size: fixed(4) # Integer size
population: uniform(1..10) # Quantities from 1 to 10
insert:
partitions: fixed(1) # One partition per batch
batchtype: UNLOGGED
select: fixed(1)/1000 # To avoid reading too much
queries:
simple1:
cql: select * from orders2 where order_id = ?
fields: samerow # Use data from the same row
/home/ubuntu/apache-cassandra-3.11.2/tools/bin/cassandra-stress user profile=ecommerce_stress.yaml n=50000 ops\(insert=1\) -rate threads=100 -node `hostname -i` -mode native cql3 user=cassandra password=cassandra
- Now validate loaded data in
aws.orders
table incassandra database
Note
Row count in below screenshot is approximate and your count may be different
Apache Cassandra CQLSH output screenshot:
- Now open
Cloudshell
from AWS Console and download the CQLReplicator repository
git clone https://github.com/aws-samples/cql-replicator.git
-
Replace the file
cql-replicator/glue/conf/CassandraConnector.conf
with fileguidance-for-near-real-time-data-migration-from-apache-cassandra-to-amazon-keyspaces/CassandraConnector.conf
-
Modify newly replaced
CassandraConnector.conf
file in directory `cql-replicator/glue/conf/' with below changes
Replace “<ip_address_cassandra_node1>” in CassandraConnector.conf with “PrivateIpInstanceOne” value from “stack_resources_cassandra_output” file
- Now intialize the CQLReplicator environment. The following command initializes the CQLReplicator environment, which involves the copying JAR artifacts, creation a Glue connector, a S3 bucket, a Glue job, migration keyspace, and ledger table.
Note: If you are running this command from cloudshell
and encountered error bc requires but it's not installed. Aborting. You could try to run: sudo yum install bc -y
, then run command sudo yum install bc -y
to resolve the error.
--sg
, Replace<CassandraSecurityGroupId>
withCassandraSecurityGroupId
value fromstack_resources_cassandra_output
file--subnet
, Replace<PrivateSubnetOne>
withPrivateSubnetOne
value fromstack_resources_cassandra_output
file--az
, replace<PrivateSubnetOneAZ>
withPrivateSubnetOneAZ
value fromstack_resources_vpc_output
file--region
, Replace<aws-region-cassandra-cluster>
with AWS Region of Cassandra cluster--glue-iam-role
, Replace<GlueRolename>
withGlueRolename
value fromstack_resources_vpc_output
file--landing-zone
, Replace<s3_bucket_name>
with S3 bucket name from Step 4 of deployment
cd cql-replicator/glue/bin
./cqlreplicator --state init --sg '"<CassandraSecurityGroupId>"' --subnet "<PrivateSubnetOne>" --az <PrivateSubnetOneAZ> --region <aws-region-cassandra-cluster> --glue-iam-role <GlueRolename> --landing-zone s3://<s3_bucket_name>
Output of successfully initialization looks like below screenshot
- Run the CQLReplicator to start the migration
cd cql-replicator/glue/bin
--landing-zone
, <s3_bucket_name> should be replaced with S3 bucket name from Step 4 of deployment--region
, <AWS_REGION> value should be replaced with AWS Region used in step 13
./cqlreplicator --state run --tiles 2 --landing-zone s3://<s3_bucket_name> --region <AWS_REGION> --writetime-column quantity --src-keyspace aws --src-table orders --trg-keyspace aws --trg-table orders
Output of command after successfully starting One Discovery Glue job and two Replicator Glue jobs
One Discovery Job from AWS Console
One Discovery Job and Two Replication jobs running screenshot
- Now check the migration stats from Cloudshell commandline
--landing-zone
, <s3_bucket_name> should be replaced with S3 bucket name from Step 4 of deployment--region
, <AWS_REGION> value should be replaced with AWS Region used in step 13
./cqlreplicator --state stats --tiles 2 --landing-zone s3://<s3_bucket_name> --region <AWS_REGION> --src-keyspace aws --src-table orders --trg-keyspace aws --trg-table orders --replication-stats-enabled
Initial Replication Stats will looks like below screenshot
Replication Stats after full load
- Now
Insert
Record into Cassandra Database and test the replication from Cassandra to Keyspaces. Connect to Cassandra client EC2 instancecqlrepl-ks-cass-CassandraClientInstance
using EC2 Instance Connect and ssh toCassandraNode-one EC2 instance
and connect tocqlsh
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-one>
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cqlsh `hostname -i` -u cassandra -p cassandra
insert into aws.orders (order_id,order_date,product_id,quantity,user_id) VALUES (50554d6e-29bb-11e5-b345-feff819cdc9f, toTimeStamp(now()),now(),10,now());
select * from aws.orders where order_id=50554d6e-29bb-11e5-b345-feff819cdc9f;
Inserted Row screenshot in Cassandra database
Validate the replication by navigating to CQL Editor
from AWS Console under Amazon Keyspaces
service
Inserted Row screenshot in Keyspaces database
- Now
Update
the Record in Cassandra Database and test the replication from Cassandra to Keyspaces. Connect to Cassandra client EC2 instancecqlrepl-ks-cass-CassandraClientInstance
using EC2 Instance Connect and ssh toCassandraNode-one EC2 instance
and connect tocqlsh
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-one>
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cqlsh `hostname -i` -u cassandra -p cassandra
update aws.orders set quantity=300 where order_id=00000000-0001-b200-0000-00000001b200 and order_date='1971-08-23 23:15:33.697';
select * from aws.orders where order_id=00000000-0001-b200-0000-00000001b200 and order_date='1971-08-23 23:15:33.697';
Updated Row screenshot in Cassandra database
Validate the replication by navigating to CQL Editor
from AWS Console under Amazon Keyspaces
service
Updated Row screenshot in Keyspaces database
- Now
Delete
the Record in Cassandra Database and test the replication from Cassandra to Keyspaces. Connect to Cassandra client EC2 instancecqlrepl-ks-cass-CassandraClientInstance
using EC2 Instance Connect and ssh toCassandraNode-one EC2 instance
and connect tocqlsh
ssh -i "my-cass-kp.pem" ubuntu@<IP Address of CassandraNode-one>
cd /home/ubuntu/apache-cassandra-3.11.2
bin/cqlsh `hostname -i` -u cassandra -p cassandra
delete from aws.orders where order_id=50554d6e-29bb-11e5-b345-feff819cdc9f;
select * from aws.orders where order_id=50554d6e-29bb-11e5-b345-feff819cdc9f;
Deleted Row screenshot in Cassandra database
Validate the replication by navigating to CQL Editor
from AWS Console under Amazon Keyspaces
service
Before row got deleted from Amazon Keyspaces
After row got deleted from Amazon Keyspaces
- Now check final stats from Cloudshell. These stats include,
Full load, Inserted row, Updated Row and Deleted Row
. These stats shows how CQLReplicator migrated data from Cassandra to Keyspaces database in near real-time
--landing-zone
, <s3_bucket_name> should be replaced with S3 bucket name from Step 4 of deployment--region
, <AWS_REGION> value should be replaced with AWS Region used in step 13
./cqlreplicator --state stats --tiles 2 --landing-zone s3://<s3_bucket_name> --region <AWS_REGION> --src-keyspace aws --src-table orders --trg-keyspace aws --trg-table orders --replication-stats-enabled
Screenshot of final stats
Having explored how to efficiently migrating data from Apache Cassandra and Amazon Keyspaces using CQLReplicator in near real-time, You can implement similar setup for your Applications during migration from Apache Cassandra to Amazon Keyspaces.
To delete resources created as part of this guidance, you can finish below steps
- Stop the CQLReplicator job
cd cql-replicator/glue/bin
--landing-zone
, <s3_bucket_name> should be replaced with S3 bucket name from Step 4 of deployment--region
, <AWS_REGION> value should be replaced with AWS Region used in step 13
./cqlreplicator --state request-stop --tiles 2 --landing-zone s3://<s3_bucket_name> --region <AWS_REGION> --src-keyspace aws --src-table orders --trg-keyspace aws --trg-table orders
Screenshot of successfully stopped CQLReplicator job will mark Glue jobs as succeeded
- Cleanup the CQLReplicator job. This will delete
S3 bucket
and will remove theAWS Glue CQLReplicator streaming job
cd cql-replicator/glue/bin
--landing-zone
, <s3_bucket_name> should be replaced with S3 bucket name from Step 4 of deployment--region
, <AWS_REGION> value should be replaced with AWS Region used in step 13
./cqlreplicator --state cleanup --landing-zone s3://<s3_bucket_name> --region <AWS_REGION>
- Delete all resources created using below command. Check cloudformation stacks
(cass-cluster-stack and cfn-vpc-ks-stack)
deletion status from cloudformation from AWS console after executing below script. IfVPC
created as part ofcfn-vpc-ks-stack
not deleted as part of below command, then run command in step 2.
cd guidance-for-near-real-time-data-migration-from-apache-cassandra-to-amazon-keyspaces
sh delete_stack.sh
- This step is
optional
and needed whenVPC
andsubnets
created as part ofcfn-vpc-ks-stack
not deleted during cloudformartion stack deletion.
cd guidance-for-near-real-time-data-migration-from-apache-cassandra-to-amazon-keyspaces
sh delete_vpc.sh
- Make sure to check for any left over resources and delete them manually to avoid any accidental charges
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.*