rabbitmq/tgir

Testing RabbitMQ Resiliency

Closed this issue · 0 comments

coro commented

How does RabbitMQ handle network latency? What about a clean network partition? And a partial network partition, or Byzantine fault?

We have at our disposal a wide variety of tooling for the Kubernetes infrastructure that will let us make new discoveries about the behaviour of RabbitMQ.

Chaos Mesh - Chaos testing framework for Kubernetes clusters by the CNCF, very recently made GA. Allows for many cluster disturbances to be run continuously, or on a cron schedule, and on subsets of pods. The different chaos events are known as experiments, which consist of:

  • Pods / Containers going down or failing
  • Network faults, such as partitions or packet loss/duplication/corruption etc.
  • CPU / memory stress in Pods
  • System Clock time offsets
  • Filesystem faults, such as permissions failures or other IO operation errors/delays
  • Injecting Kernel faults into pods

Obligatory commit strip

Strip-Ca-marche-plus-650-finalenglish

Other tools

RabbitTestTool - Tool for orchestration and benchmarking of RabbitMQ clusters in EC2, GKE or EKS. Allows for 'playlists' to be created and run, where a playlist consists of systems, benchmarks and workloads. This allows easy A/B benchmarking of the same workloads against different systems, or different workloads on the same system.

Kubestone - Benchmarking tool for Kubernetes clusters. Implements a number of controllers for various benchmarking tools, such as system performance profiling, HTTP load benchmarks, etc. Can be extended with custom operators to support different benchmarks. For example, we could contribute to this project to provide a RabbitMQ benchmark if we so wished.

Sonobuoy - Security benchmarking & e2e testing of K8s workloads. Extensible with custom plugins. We could also contribute to this project with a RabbitMQ benchmarking plugin.