zeebe-io/zeebe-chaos

[EPIC] A reusable fault-injector and resolver

Zelldon opened this issue · 3 comments

Motivation

Currently we have several shell scripts to execute chaos experiments with chaostoolkit. The scripts are currently working well, but the maintenance is rather hard, especially for people which might not familiar enough with bash.

This is the reason why we already migrated some of them to a kotlin based chaos worker. But we haven't done this for scripts which directly interact with the kubernetes API. The problem here is that we would need an executable cli to reference them also in the chaostoolkit experiments, to run it locally. Furthermore, the interaction with kubernetes in go, I would say, is easier/better.

Solution

We create a new go cli, with cobra. The cli allows to be executed by chaostoolkit locally. Furthermore, we use the zeebe go worker api such that we can register on the testbench. We use the go kubernetes client to interact with the kubernetes api, and use retry functionalities as we do in the shell scripts to make the experiments less flaky.

Benefit of this would be to familiarize a bit more with go and our provided go client.

Todo's left:

  • Feature parity
  • Clean up:

Inventory

In order to see what is left and missing here is a table of scripts/functionality and the related mapping in zbchaos

Script Function Zbchaos counterpart
apply_net_admin.sh Applies the NET_ADMIN capability to the Zeebe brokers Part of zbchaos disconnect
await-message-correlation.sh Deploys a model with a msg catch event and awaits the completion of the instance Can be done via zbchaos verify steady-state --awaitResult --processModelPath, since we can define the model and await the completion
await-processes-with-result.sh Deploys a model, creates a PI and awaits the completion zbchaos verify steady-state --awaitResult
connect-leaders.sh Connect the brokers after disconnecting them zbchaos connect brokers
connect-standalone-gateway.sh Connect the standalone gateway again zbchaos connect gateway
deploy-different-versions.sh Deploy different versions of a certain model. zbchaos deploy process
deploy-model.sh Deploy a process model Part of zbchaos verify steady-state
corrupt* Corrupt a followers snapshot Not part of zbchaos, since it is no longer in use.
disconnect-leaders-one-way.sh Disconnect Leaders asymmetric zbchaos disconnect brokers --one-direction
disconnect-leaders.sh Disconnect Leaders bi-directionally zbchaos disconnect brokers
disconnect-standalone-gateway.sh Disconnect a standalone gateway from brokers zbchaos disconnect gateway --all
publish-message.sh Publishes a message to partition one zbchaos publish This command also supports specifying different partitions and different message names.
shutdown-gracefully-partition.sh Shutdowns a broker with given partition and role zbchaos restart This command allows to specify a broker via nodeId or via partitionId and role.
start-instance-on-partition-with-version.sh Starts an instance with a specific version on a specific partition. zbchaos verify steady-state --version
start-many-instances.sh Starts many instances in the zeebe cluster Not supported right now, and not used in our current experiments
stress-cpu.sh Stresses the CPU with extra workload on a specific node (gateway or broker) zbchaos stress gateway/broker --cpu
terminate-partition.sh Terminates a broker with given partition and role zbchaos terminate This command allows to specify a broker via nodeId or via partitionId and role.
terminate-workers.sh Terminates workers in the zeebe cluster zbchaos terminate worker
util* Contain util functions Not necessary to be ported
verify-readiness.sh Verifies the readiness, which means checks whether the gateway has a Available deployment and the Brokers has ready pods. zbchaos verify readiness
verify-steady-state.sh Verifies the steady state, which means deploying a process model, and creating instances until a required partition is reached. zbchaos verify steady-state
zbctl-start-instances.sh Used to created instances on pod Not necessary to be ported, part of start-many-instances.sh

In order to understand which experiments are supported right now with zbchaos and which are missing I will list them in the following table. Be aware that I will only mention the Production-S experiments since these are the only experiments that we have automated.

Experiment Supported by zbchaos Details
deployment-distribution YES -
follower-restart YES -
follower-terminate YES -
leader-restart YES -
leader-terminate YES -
msg-correlation YES -
multiple-leader-restart YES -
stress-cpu-on-broker YES -
worker-restart YES -

What else is missing:

The current kotlin worker does also some other things we need to port before we can remove it completely.

  • Read all experiments and return them as variables. This is necessary for the chaos experiment automation, to know which experiment are executed and how they look like. This means which action needs to be executed etc.
  • Deploy workers as part of the chaos worker. Might make sense to add this as extra subcommand to zbchaos to deploy workers which can complete instances.
  • Adjust the experiments such they use the zbchaos commands, instead of referencing the scripts. This can be done incrementally #237
  • Use json logging in chaos worker zbchaos

Done


Q2 2022 KR A reusable fault-injector and resolver is implemented and used in the Zeebe E2E and chaos tests

In order make progress on this EPIC/Project I want to use the 2022 Summerhackdays.

My plan for the hackdays:

Preparation:

  • Setup go project
  • Setup go ci

Hackdays:

  • Start with base (and make use in CLI)

    • I can restart a leader of partition x
    • I can restart a follower of partition x
    • I can restart a gateway
    • All of them above gracefully and non gracefully (force)
    • I can disconnect nodes
      • Leader with other Leader
      • Leader with Follower
      • Follower with Leader
      • Gateway to Broker
      • Broker to Gateway
      • Bidirectional as well
    • I can restart a Broker and delete the PVC (data loss) I will not do it right now
    • I can send a message to a specific partition
    • I can start instances etc.
    • Replace all existing scripts (used scripts; verify that first):
      • apply_net_admin.sh
      • apply_sys_time.sh not used
      • await-message-correlation.sh
      • await-processes-with-result.sh
      • capture-and-compare-status.sh
      • complete-instance.sh
      • connect-leader-follower.sh
      • connect-leaders.sh
      • connect-nodes.sh
      • connect-standalone-gateway.sh
      • corruptFollowers.sh
      • corruptSnapshot.sh
      • deploy-different-versions.sh
      • deploy-model.sh
      • deploy-specific-model.sh
      • disconnect-elastic.sh
      • disconnect-leader-follower.sh
      • disconnect-leaders-one-way.sh
      • disconnect-leaders.sh
      • disconnect-standalone-gateway.sh
      • net_admin_patch.yaml
      • network-partition.sh
      • publish-message.sh
      • README.md
      • shutdown-gracefully-partition.sh
      • start-instance-on-partition-with-version.sh
      • start-many-instances.sh
      • stress-cpu.sh
      • sys_time_patch.yaml
      • terminate-partition.sh
      • terminate-workers.sh
      • turn-down-leader-regulary.sh
      • utils.sh
      • utilsTest.sh
      • verify-readiness.sh
      • verify-steady-state.sh
      • zbctl-start-instances.sh
  • Make use of base and create Chaos Worker's

    • Makes use of functionality above and provides that as workers (can be used by testbench later)
    • tbd

Regarding testing internal backend which talks with kubernetes API and uses the k8 client we can use some fake client https://medium.com/the-phi/mocking-the-kubernetes-client-in-go-for-unit-testing-ddae65c4302

https://pkg.go.dev/k8s.io/client-go/kubernetes/fake#Clientset.AppsV1

Example:

**
type testClientConfig struct {
	namespace          string
	namespaceSpecified bool
	err                error
}

func (c *testClientConfig) Namespace() (string, bool, error) {
	return c.namespace, c.namespaceSpecified, c.err
}

func (c *testClientConfig) RawConfig() (api.Config, error) {
	panic("implement me")
}

func (c *testClientConfig) ClientConfig() (*rest.Config, error) {
	panic("implement me")
}

func (c *testClientConfig) ConfigAccess() clientcmd.ConfigAccess {
	panic("implement me")
}

func Test_GetBrokerPodNames(t *testing.T) {
	// given
	k8Client := K8Client{Clientset: fake.NewSimpleClientset(), ClientConfig: &testClientConfig{namespace: "default"}}

	k8Client.Clientset.CoreV1().Pods(k8Client.GetCurrentNamespace()).Create(context.TODO(), &v1.Pod{
		Spec: v1.PodSpec{

		},
	}, v12.CreateOptions{})

	// when
	names, err := k8Client.GetBrokerPodNames()

	// then
	require.NoError(t, err)
	require.NotNil(t, names)
}

https://www.youtube.com/watch?v=reDCJYbxtRg&ab_channel=CNCF%5BCloudNativeComputingFoundation%5D

Happy to announce that the EPIC is done and we have release v1.0.0 https://github.com/zeebe-io/zeebe-chaos/releases/tag/zbchaos-v1.0.0