[EPIC] A reusable fault-injector and resolver

Question

[EPIC] A reusable fault-injector and resolver

Zelldon opened this issue 2 years ago · 3 comments

Motivation

Currently we have several shell scripts to execute chaos experiments with chaostoolkit. The scripts are currently working well, but the maintenance is rather hard, especially for people which might not familiar enough with bash.

This is the reason why we already migrated some of them to a kotlin based chaos worker. But we haven't done this for scripts which directly interact with the kubernetes API. The problem here is that we would need an executable cli to reference them also in the chaostoolkit experiments, to run it locally. Furthermore, the interaction with kubernetes in go, I would say, is easier/better.

Solution

We create a new go cli, with cobra. The cli allows to be executed by chaostoolkit locally. Furthermore, we use the zeebe go worker api such that we can register on the testbench. We use the go kubernetes client to interact with the kubernetes api, and use retry functionalities as we do in the shell scripts to make the experiments less flaky.

Benefit of this would be to familiarize a bit more with go and our provided go client.

Todo's left:

Feature parity
Clean up:

Inventory

In order to see what is left and missing here is a table of scripts/functionality and the related mapping in zbchaos

Script	Function	Zbchaos counterpart
`apply_net_admin.sh`	Applies the NET_ADMIN capability to the Zeebe brokers	Part of `zbchaos disconnect`
`await-message-correlation.sh`	Deploys a model with a msg catch event and awaits the completion of the instance	Can be done via `zbchaos verify steady-state --awaitResult --processModelPath`, since we can define the model and await the completion
`await-processes-with-result.sh`	Deploys a model, creates a PI and awaits the completion	`zbchaos verify steady-state --awaitResult`
`connect-leaders.sh`	Connect the brokers after disconnecting them	`zbchaos connect brokers`
`connect-standalone-gateway.sh`	Connect the standalone gateway again	`zbchaos connect gateway`
`deploy-different-versions.sh`	Deploy different versions of a certain model.	`zbchaos deploy process`
`deploy-model.sh`	Deploy a process model	Part of `zbchaos verify steady-state`
`corrupt*`	Corrupt a followers snapshot	Not part of zbchaos, since it is no longer in use.
`disconnect-leaders-one-way.sh`	Disconnect Leaders asymmetric	`zbchaos disconnect brokers --one-direction`
`disconnect-leaders.sh`	Disconnect Leaders bi-directionally	`zbchaos disconnect brokers`
`disconnect-standalone-gateway.sh`	Disconnect a standalone gateway from brokers	`zbchaos disconnect gateway --all`
`publish-message.sh`	Publishes a message to partition one	`zbchaos publish` This command also supports specifying different partitions and different message names.
`shutdown-gracefully-partition.sh`	Shutdowns a broker with given partition and role	`zbchaos restart` This command allows to specify a broker via nodeId or via partitionId and role.
`start-instance-on-partition-with-version.sh`	Starts an instance with a specific version on a specific partition.	`zbchaos verify steady-state --version`
`start-many-instances.sh`	Starts many instances in the zeebe cluster	Not supported right now, and not used in our current experiments
`stress-cpu.sh`	Stresses the CPU with extra workload on a specific node (gateway or broker)	`zbchaos stress gateway/broker --cpu`
`terminate-partition.sh`	Terminates a broker with given partition and role	`zbchaos terminate` This command allows to specify a broker via nodeId or via partitionId and role.
`terminate-workers.sh`	Terminates workers in the zeebe cluster	`zbchaos terminate worker`
`util*`	Contain util functions	Not necessary to be ported
`verify-readiness.sh`	Verifies the readiness, which means checks whether the gateway has a Available deployment and the Brokers has ready pods.	`zbchaos verify readiness`
`verify-steady-state.sh`	Verifies the steady state, which means deploying a process model, and creating instances until a required partition is reached.	`zbchaos verify steady-state`
`zbctl-start-instances.sh`	Used to created instances on pod	Not necessary to be ported, part of `start-many-instances.sh`

In order to understand which experiments are supported right now with zbchaos and which are missing I will list them in the following table. Be aware that I will only mention the Production-S experiments since these are the only experiments that we have automated.

Experiment	Supported by zbchaos	Details
deployment-distribution	YES	-
follower-restart	YES	-
follower-terminate	YES	-
leader-restart	YES	-
leader-terminate	YES	-
msg-correlation	YES	-
multiple-leader-restart	YES	-
stress-cpu-on-broker	YES	-
worker-restart	YES	-

What else is missing:

The current kotlin worker does also some other things we need to port before we can remove it completely.

Read all experiments and return them as variables. This is necessary for the chaos experiment automation, to know which experiment are executed and how they look like. This means which action needs to be executed etc.
Deploy workers as part of the chaos worker. Might make sense to add this as extra subcommand to zbchaos to deploy workers which can complete instances.
Adjust the experiments such they use the zbchaos commands, instead of referencing the scripts. This can be done incrementally #237
Use json logging in chaos worker zbchaos

Done

Q2 2022 KR A reusable fault-injector and resolver is implemented and used in the Zeebe E2E and chaos tests

Implement a new go application which can be used, locally as cli and as worker library for testbench
- #140
- Split up https://github.com/zeebe-io/zeebe-chaos/pulls/122
- Add CI; (github actions)
- TODO...

Answer 1 · 2022-08-08T06:48:44.000Z

In order make progress on this EPIC/Project I want to use the 2022 Summerhackdays.

My plan for the hackdays:

Preparation:

Setup go project
Setup go ci
- Follow #140

Hackdays:

Start with base (and make use in CLI)
- I can restart a leader of partition x
- I can restart a follower of partition x
- I can restart a gateway
- All of them above gracefully and non gracefully (force)
- I can disconnect nodes
  - Leader with other Leader
  - Leader with Follower
  - Follower with Leader
  - Gateway to Broker
  - Broker to Gateway
  - Bidirectional as well
- ~~I can restart a Broker and delete the PVC (data loss)~~ I will not do it right now
- I can send a message to a specific partition
- I can start instances etc.
- Replace all existing scripts (used scripts; verify that first):
  - apply_net_admin.sh
  - ~~apply_sys_time.sh~~ not used
  - await-message-correlation.sh
  - await-processes-with-result.sh
  - capture-and-compare-status.sh
  - complete-instance.sh
  - ~~connect-leader-follower.sh~~
  - connect-leaders.sh
  - ~~connect-nodes.sh~~
  - connect-standalone-gateway.sh
  - corruptFollowers.sh
  - corruptSnapshot.sh
  - deploy-different-versions.sh
  - deploy-model.sh
  - deploy-specific-model.sh
  - ~~disconnect-elastic.sh~~
  - ~~disconnect-leader-follower.sh~~
  - disconnect-leaders-one-way.sh
  - disconnect-leaders.sh
  - disconnect-standalone-gateway.sh
  - net_admin_patch.yaml
  - ~~network-partition.sh~~
  - publish-message.sh
  - README.md
  - shutdown-gracefully-partition.sh
  - start-instance-on-partition-with-version.sh
  - start-many-instances.sh
  - stress-cpu.sh
  - ~~sys_time_patch.yaml~~
  - terminate-partition.sh
  - terminate-workers.sh
  - ~~turn-down-leader-regulary.sh~~
  - utils.sh
  - utilsTest.sh
  - verify-readiness.sh
  - verify-steady-state.sh
  - zbctl-start-instances.sh
Make use of base and create Chaos Worker's
- Makes use of functionality above and provides that as workers (can be used by testbench later)
- tbd

Answer 2 · 2022-08-10T12:37:19.000Z

Regarding testing internal backend which talks with kubernetes API and uses the k8 client we can use some fake client https://medium.com/the-phi/mocking-the-kubernetes-client-in-go-for-unit-testing-ddae65c4302

https://pkg.go.dev/k8s.io/client-go/kubernetes/fake#Clientset.AppsV1

Example:

**
type testClientConfig struct {
	namespace          string
	namespaceSpecified bool
	err                error
}

func (c *testClientConfig) Namespace() (string, bool, error) {
	return c.namespace, c.namespaceSpecified, c.err
}

func (c *testClientConfig) RawConfig() (api.Config, error) {
	panic("implement me")
}

func (c *testClientConfig) ClientConfig() (*rest.Config, error) {
	panic("implement me")
}

func (c *testClientConfig) ConfigAccess() clientcmd.ConfigAccess {
	panic("implement me")
}

func Test_GetBrokerPodNames(t *testing.T) {
	// given
	k8Client := K8Client{Clientset: fake.NewSimpleClientset(), ClientConfig: &testClientConfig{namespace: "default"}}

	k8Client.Clientset.CoreV1().Pods(k8Client.GetCurrentNamespace()).Create(context.TODO(), &v1.Pod{
		Spec: v1.PodSpec{

		},
	}, v12.CreateOptions{})

	// when
	names, err := k8Client.GetBrokerPodNames()

	// then
	require.NoError(t, err)
	require.NotNil(t, names)
}

https://www.youtube.com/watch?v=reDCJYbxtRg&ab_channel=CNCF%5BCloudNativeComputingFoundation%5D

Answer 3 · 2022-12-21T09:32:03.000Z

Happy to announce that the EPIC is done and we have release v1.0.0 https://github.com/zeebe-io/zeebe-chaos/releases/tag/zbchaos-v1.0.0