
Runnig self-healing Ibm MQ in two Zones in pre-existing VPC and Subnets

Primary LanguageShell

MQ on AWS: PoC of high availability using EFS

Amazon recently declared its Elastic File System (EFS) as ready for production. This enables a shared, networked file system, which (importantly) is replicated between multiple physical data centers (availability zones). On paper, this makes EFS a good candidate for running MQ in a highly available way. In this blog entry, I'll take you through our proof of concept (PoC) of running a single IBM MQ queue manager which can be automatically moved between availability zones in the case of a failure.


An EFS file system is scoped to a particular AWS region. You can create "mount targets" for VPC subnets in different availability zones within that region. Once the mount target has been created, EC2 instances in those subnets can successfully mount the file system using NFS v4. You can read more about EFS in the AWS EFS documentation.

In this PoC, we used CloudFormation to run a single EC2 instance running MQ, as part of an Auto Scaling Group of one server. This ensures that if the MQ instance is determined to be unhealthy, then AWS will destroy the instance and replace it with a new one, connected back to the same file system. You can span multiple availability zones with an Auto Scaling Group. The Auto Scaling Group has a policy applied to ensure that there are only ever 0 or 1 instances available: during an update to the CloudFormation stack, the existing instance is always terminated before starting a new one.

When the MQ EC2 instance first boots, it mounts the file system as /var/mqm, and adds a rule to /etc/fstab to ensure that it gets mounted again if the instance were re-booted. If there's already data for a queue manager in the file system, then it sets up a systemd service to run the queue manager, and creates a dependency on the mount point being available. This systemd service will also ensure that the queue manager is restarted upon re-boot.

We also used an Elastic Load Balancer (ELB) to provide a single TCP/IP endpoint for MQ client applications to connect to. In some ways, an ELB is overkill here - alternatives include using an Elastic IP address which can be re-bound to a different EC2 instance, or using Route 53 to handle it via DNS. With the ELB, we can also add a health check, to ensure that MQ is listening on port 1414, and mark the instance as unhealthy if not. In addition, we added a health check to the instance which periodically runs dspmq to check that the queue manager is running. If it is ever found to be down, then the AWS command line interface is used to mark the instance as unhealthy. Any unhealthy instances will be terminated and replaced by the Auto Scaling Group.

Reproducing our PoC

If you'd like to try this out for yourself, then you can use the following instructions. The PoC requires Packer to be installed on your local laptop or workstation.

  1. Run packer build packer-mq-aws.json to build an AMI in the eu-west-1 (Irland) region. If you'd like to use a different region, you can edit the JSON file, making sure to also replace the source_ami with the equivalent RHEL 7.2 AMI in your chosen region. Note that, at the time of writing, EFS is not available in all regions. Before runign the command you have to add the corect VPC and Subnet where the image will be build . Thise parameters are in packer-mq-aws.json file : "vpc_id": "vpc-XXXXXXX", "subnet_id": "subnet-XXXXXXX",
  2. Create a stack using the CloudFormation template ibm-mq-efs.yaml. The teample hafe parameters like name of the VPC , Subnets etc. that need to to be known.
$ aws cloudformation create-stack --stack-name mqdev-efs \
        --template-body file://./ibm-mq-efs.yaml \
        --capabilities CAPABILITY_IAM --region eu-west-1 \
        --parameters ParameterKey=KeyName,ParameterValue=${MY_KEY} \
        ParameterKey=VPC,ParameterValue=${MY_VPC} \
        ParameterKey=Subnet1,ParameterValue=${MY_SUBNET1} \
        ParameterKey=Subnet2,ParameterValue=${MY_SUBNET2} \
        ParameterKey=QueueManagerName,ParameterValue=mqdev \
        ParameterKey=AMI,ParameterValue=${MY_AMI} \
        ParameterKey=AvailabilityZone1,ParameterValue=eu-west-1a \

The CloudFormation template includes resources, including a the Auto Scaling Group and Launch Configuration, and an IAM role to enable the EC2 instances to report their health but it will not create the nessery network. So you have to have preexsisting VPC and Subnets.

If you inspect the created resources, you will see an Auto Scaling Group with a single instance. You have several options to test out the fail-over:

  1. SSH into the instance and stop/kill the MQ queue manager (with user ec2-user). This will cause the local health-checking script to invoke the AWS CLI to mark the instance as unhealthy.
  2. Terminate the instance entirely.
  3. Mark the instance as unhealthy, either in the web console or on the command line.

Once the instance is marked as unhealthy, the AWS Auto Scaling Group will create a new one. Note that as the instance is in an otherwise-healthy availability zone, the instance may be re-created in the same zone. If you keep trying though, eventually, AWS should randomly assign the instance to the secondary zone.

Note that if you want to connect to the queue manager using an MQ client, the supplied scripts set up a PASSWORD.SVRCONN channel, with a user of johndoe, and a password of passw0rd. It is, of course, recommended that you (at the very least) change this password, which can be found in the configure-mq-aws.sh script.

Next steps and conclusion

This is just a PoC, but so far, EFS seems to provide the right characteristics for running MQ. There is clearly more to do here, including comprehensive testing of fail-over under load, and performance testing. With this particular set up, the fail-over between zones seems to take a between one and three minutes, but that's nothing to do with EFS, and everything to do with the fact that we're creating a brand new EC2 instance when the old one fails - alternative solutions might use multi-instance queue managers, or an otherwise pre-created EC2 instance. There's also some scope for better tuning the health check grace periods, to ensure things return to "healthy" status as quickly as possible.

A fail-over time for a single-instance queue manager measuring in a small number of minutes may well be enough for many people. Either way, with EFS it's relatively easy to set up high availability across multiple availability zones without having to run your own replicated storage subsystem, which is definitely a positive thing.