Chaos framework for the Storage Service
There are some built-in plans in nebula-chaos. Each plan is a json in conf directory. The plan need to specify some instances (usually including nebula graph/meta/storage) and some actions. The actions is a collection of different type actions, which forms a dag. The dependency between actions need to be specified in depends
field. Most of the action need to specify related nebula instance in inst_index
field. You can add customize based on these rules.
A utils to draw a flow chart of the plan is included, use it like this: python3 src/tools/FlowChart.py conf/scale_up_and_down.json
.
Clean all wals of specified space, then start all services, write a circle, then check data integrity.
Start all services, disturb (random kill a storage service, clean the data path, restart) while write a circle, then check data integrity.
Start all services, disturb (random kill and restart a storage service) while write and read.
Start all services, kill all storage services and restart while writing.
Start 3 storage servies, add 4th storage service using balance data
while write a circle, then check data integrity. Then stop 1st storage service, remove it using balance data
while write a circle then check data integrity. Likewise,
add 1st storage service back and remove the 4th storage service.
Similar to scale_up_and_down
, but loop for serveral times. Loop once takes about 15 mins which may vary according to the environment, you can adjust the loop times by yourself.
Start all services, disturb (random drop all packets of a storage service, recover later) while write a circle, then check data integrity. The network partition is based on iptables. Make sure the user has sudo authority and can execute iptables without password.
PS: all storage services in random_network_partition and random_traffic_control must be deployed on different ip. The reason is that we don't know the source port of storage service, we can only use ip to indicate the service.
Start all services, disturb (random delay all packets of a storage service, recover later) while write a circle, then check data integrity. The traffic is based on tcconfig, which is a tc
command wrapper. Install it at first, since it will use tc
and ip
command, use the following scripts to make it has capabilities with not super user.
setcap cap_net_admin+ep /usr/sbin/tc
setcap cap_net_raw,cap_net_admin+ep /usr/sbin/ip
Start all services, disturb (cat /dev/zero until disk is full) while write a circle, the storage services which use the direcory should be crashed, then we clean the mock file and restart, check data integrity at last.
Use a ramdisk or tmpfs with limited size to test this plan, otherwise the whole disk will be occupied.
Start all services, disturb (simulate slow disk io) while write a circle, then check data integrity. We use SysytemTap to simulate slow disk io. The major
and minor
field is the MAJOR/MINOR device id of disk where storage serveice's data path mounted.
yum install systemtap
You may need install kernel-devel
and kernel-debuginfo
as well (the version must be same with kernel).