/nebula-chaos

Chaos framework for the Storage Service

Primary LanguageC++

nebula-chaos

Chaos framework for the Storage Service

Plan Intro

There are some built-in plans in nebula-chaos. Each plan is a json in conf directory. The plan need to specify some instances (usually including nebula graph/meta/storage) and some actions. The actions is a collection of different type actions, which forms a dag. The dependency between actions need to be specified in depends field. Most of the action need to specify related nebula instance in inst_index field. You can add customize based on these rules.

A utils to draw a flow chart of the plan is included, use it like this: python3 src/tools/FlowChart.py conf/scale_up_and_down.json.

Clean all wals of specified space, then start all services, write a circle, then check data integrity.

Start all services, disturb (random kill a storage service, clean the data path, restart) while write a circle, then check data integrity.

Start all services, disturb (random kill and restart a storage service) while write and read.

Start all services, kill all storage services and restart while writing.

Start 3 storage servies, add 4th storage service using balance data while write a circle, then check data integrity. Then stop 1st storage service, remove it using balance data while write a circle then check data integrity. Likewise, add 1st storage service back and remove the 4th storage service.

Similar to scale_up_and_down, but loop for serveral times. Loop once takes about 15 mins which may vary according to the environment, you can adjust the loop times by yourself.

Start all services, disturb (random drop all packets of a storage service, recover later) while write a circle, then check data integrity. The network partition is based on iptables. Make sure the user has sudo authority and can execute iptables without password.

PS: all storage services in random_network_partition and random_traffic_control must be deployed on different ip. The reason is that we don't know the source port of storage service, we can only use ip to indicate the service.

Start all services, disturb (random delay all packets of a storage service, recover later) while write a circle, then check data integrity. The traffic is based on tcconfig, which is a tc command wrapper. Install it at first, since it will use tc and ip command, use the following scripts to make it has capabilities with not super user.

setcap cap_net_admin+ep /usr/sbin/tc
setcap cap_net_raw,cap_net_admin+ep /usr/sbin/ip

Start all services, disturb (cat /dev/zero until disk is full) while write a circle, the storage services which use the direcory should be crashed, then we clean the mock file and restart, check data integrity at last.

Use a ramdisk or tmpfs with limited size to test this plan, otherwise the whole disk will be occupied.

Start all services, disturb (simulate slow disk io) while write a circle, then check data integrity. We use SysytemTap to simulate slow disk io. The major and minor field is the MAJOR/MINOR device id of disk where storage serveice's data path mounted.

yum install systemtap

You may need install kernel-devel and kernel-debuginfo as well (the version must be same with kernel).