alexei-led/pumba

Running stress-ng with K8s on an alpine based image causes errors

ManuelOverdijk opened this issue · 10 comments

Hi Alexei,

First off, thank you for developing Pumba, it has been amazing so far!

However, I'm now trying to stress test a pod deployed within Kubernetes and I'm running into an issue I can't explain. Running it locally with Docker works just fine.

DockerFile of the container that needs to be stress tested:

FROM openjdk:8-jre-alpine
RUN apk add --no-cache bash
RUN apk add --update --no-cache iproute2
ADD ./target/ts-travel-service-1.0.jar /app/
CMD ["java", "-Xmx1G", "-jar", "/app/ts-travel-service-1.0.jar"]

Architecture: amd64

Pumba DaemonSet arguments:

            - --log-level
            - debug
            - --label
            - netem=true
            - --interval
            - 2m
            - stress
            - --duration
            - 1m

Creating the Pumba container results in the following error:

level=debug msg="stress testing container for duration" container=8858cd05c85c06d0949e962aa9c85651b777edac0fc05f1569c83ec54e8becc6 duration=1m0s pull image=true stress-ng image="alexeiled/stress-ng:latest-ubuntu" stressors="[--cpu 4 --timeout 60s]"
time="2020-03-02T09:52:52Z" level=info msg="stress testing container" dryrun=false duration=1m0s id=8858cd05c85c06d0949e962aa9c85651b777edac0fc05f1569c83ec54e8becc6 image="alexeiled/stress-ng:latest-ubuntu" name=/k8s_POD_ts-travel-service-dff6d79cf-mxcgs_trainticket_36622801-5c61-11ea-b876-020644012916_9 pull=true stressors="[--cpu 4 --timeout 60s]"
time="2020-03-02T09:52:52Z" level=debug msg="executing stress-ng command" image="alexeiled/stress-ng:latest-ubuntu" pull=true stressors="[--cpu 4 --timeout 60s]" target=8858cd05c85c06d0949e962aa9c85651b777edac0fc05f1569c83ec54e8becc6
time="2020-03-02T09:52:52Z" level=debug msg="pulling stress-ng image" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-03-02T09:52:53Z" level=debug msg="&{Pulling from alexeiled/stress-ng   {0 0}}"
time="2020-03-02T09:52:53Z" level=debug msg="&{Digest: sha256:da1ead7d5fdba2c1c299fd0c1f9c98169e8690ce9cd2b0eb9be5d8d163c9a088   {0 0}}"
time="2020-03-02T09:52:53Z" level=debug msg="&{Status: Image is up to date for alexeiled/stress-ng:latest-ubuntu   {0 0}}"
time="2020-03-02T09:52:53Z" level=debug msg="creating stress-ng container" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-03-02T09:52:53Z" level=debug msg="stress-ng container created, starting it" id=d9c2b2f4a88f6bd0fef32619fd700fc5e51bfe0f3ffdf4f5b9c7e09cfece12a0
time="2020-03-02T09:52:55Z" level=fatal msg="one or more stress test failed: stress-ng failed with error: stress-ng exited with error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed\n"

So far I am unable to find the cause of cgroup change of group failed. Is this error familiar to you, and if so, could you point me in the right direction?

Thanks in advance, and once again, Pumba is really appreciated!

Hi Alexei, I got the same error when stressing the containers started by docker-compose as well.

Hi guys, I’m on vacation this week. Will check next week. Thank you for reporting the issue!

Hi Alexei, any news on this issue?

This week hopefully

Hi @ManuelOverdijk, can you provide more details about your K8s infra?
Node OS, version:
K8s version:
EKS/GKE/Custom/etc:

Sure!

Node details:

  Kernel Version:             4.14.154-128.181.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://18.9.9
  Kubelet Version:            v1.14.8-eks-b8860f
  Kube-Proxy Version:         v1.14.8-eks-b8860f

Let me know if you need more details!

@ManuelOverdijk found the issue
K8s puts all Docker containers under /kubepods/besteffort/pod${ID} or kubepods/burstable/pod${ID} parent cgroup. Docker uses docker parent cgroup.
I've fixed alexeiled/stress-ng:ubuntu-latest image (dockhack script) to inject stress-ng under proper cgroup

Another thing, pumba should run with SYS_ADMIN capabilities to be able to inject stress-ng into target container and NET_ADMIN to inject network failures.

Hey, @alexei-led! I'm afraid this issue is still occurring.

Test environment

minikube v1.9.0 on Microsoft Windows 10 Pro 10.0.18363 Build 18363
Kubernetes v1.18.0
Docker 19.03.8

k8s stress-ng resource

apiVersion: v1
kind: Pod
metadata:
  name: pumba-stress
  labels:
    com.gaiaadm.pumba: "true"
spec:
  containers:
    # randomly pause containers in Pod named 'test-stress' every 3m for 1m
    - image: gaiaadm/pumba
      imagePullPolicy: IfNotPresent
      name: pumba-stress
      args:
        - --log-level
        - debug
        - --label
        - io.kubernetes.pod.name=test-stress
        - --interval
        - 3m
        - stress
        - --duration
        - 1m
        - --stressors
        - --cpu 2 --timeout 3m
      securityContext:
        capabilities:
          add: ["SYS_ADMIN"]
      resources:
        requests:
          cpu: 10m
          memory: 5M
        limits:
          cpu: 100m
          memory: 20M
      volumeMounts:
        - name: dockersocket
          mountPath: /var/run/docker.sock
  volumes:
    - hostPath:
        path: /var/run/docker.sock
      name: dockersocket

Output

time="2020-04-21T10:30:08Z" level=debug msg="stress testing all matching containers"
time="2020-04-21T10:30:08Z" level=debug msg="listing matching containers" duration=1m0s labels="[io.kubernetes.pod.name=test-stress]" limit=0 names="[]" pattern= random=false stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:08Z" level=debug msg="listing containers"
time="2020-04-21T10:30:09Z" level=debug msg="found container" id=f09136dc5356dcd9f95e5702012b171915dc1edaff118fe4ba867f8e6867de85 name=/k8s_test-stress-3_test-stress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0
time="2020-04-21T10:30:09Z" level=debug msg="found container" id=4e3f3e9f80a9a1e0e5eea7fd4b73b3a807b32fe5f56d3add646b2e9229ba2830 name=/k8s_test-stress-2_test-stress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0
time="2020-04-21T10:30:09Z" level=debug msg="found container" id=4556d5f6dfe6aa48f77afe8f3355c4c6bab8fee4a656e12e9133e7241fedc636 name=/k8s_test-stress-1_test-stress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0
time="2020-04-21T10:30:09Z" level=debug msg="found container" id=58166882ce828d074bf2fd825a9dc38aee9817dac37d17e29ab9ee06fd5279e2 name=/k8s_POD_test-stress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0
time="2020-04-21T10:30:09Z" level=debug msg="stress testing container for duration" container=58166882ce828d074bf2fd825a9dc38aee9817dac37d17e29ab9ee06fd5279e2 duration=1m0s pull image=true stress-ng image="alexeiled/stress-ng:latest-ubun
tu" stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=info msg="stress testing container" dryrun=false duration=1m0s id=58166882ce828d074bf2fd825a9dc38aee9817dac37d17e29ab9ee06fd5279e2 image="alexeiled/stress-ng:latest-ubuntu" name=/k8s_POD_test-stress_rcni
t-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0 pull=true stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=debug msg="executing stress-ng command" image="alexeiled/stress-ng:latest-ubuntu" pull=true stressors="[--cpu 2 --timeout 3m]" target=58166882ce828d074bf2fd825a9dc38aee9817dac37d17e29ab9ee06fd5279e2
time="2020-04-21T10:30:09Z" level=debug msg="pulling stress-ng image" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:09Z" level=debug msg="stress testing container for duration" container=4e3f3e9f80a9a1e0e5eea7fd4b73b3a807b32fe5f56d3add646b2e9229ba2830 duration=1m0s pull image=true stress-ng image="alexeiled/stress-ng:latest-ubun
tu" stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=info msg="stress testing container" dryrun=false duration=1m0s id=4e3f3e9f80a9a1e0e5eea7fd4b73b3a807b32fe5f56d3add646b2e9229ba2830 image="alexeiled/stress-ng:latest-ubuntu" name=/k8s_test-stress-2_test-s
tress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0 pull=true stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=debug msg="executing stress-ng command" image="alexeiled/stress-ng:latest-ubuntu" pull=true stressors="[--cpu 2 --timeout 3m]" target=4e3f3e9f80a9a1e0e5eea7fd4b73b3a807b32fe5f56d3add646b2e9229ba2830
time="2020-04-21T10:30:09Z" level=debug msg="pulling stress-ng image" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:09Z" level=debug msg="stress testing container for duration" container=4556d5f6dfe6aa48f77afe8f3355c4c6bab8fee4a656e12e9133e7241fedc636 duration=1m0s pull image=true stress-ng image="alexeiled/stress-ng:latest-ubun
tu" stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=info msg="stress testing container" dryrun=false duration=1m0s id=4556d5f6dfe6aa48f77afe8f3355c4c6bab8fee4a656e12e9133e7241fedc636 image="alexeiled/stress-ng:latest-ubuntu" name=/k8s_test-stress-1_test-s
tress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0 pull=true stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=debug msg="executing stress-ng command" image="alexeiled/stress-ng:latest-ubuntu" pull=true stressors="[--cpu 2 --timeout 3m]" target=4556d5f6dfe6aa48f77afe8f3355c4c6bab8fee4a656e12e9133e7241fedc636
time="2020-04-21T10:30:09Z" level=debug msg="pulling stress-ng image" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:09Z" level=debug msg="stress testing container for duration" container=f09136dc5356dcd9f95e5702012b171915dc1edaff118fe4ba867f8e6867de85 duration=1m0s pull image=true stress-ng image="alexeiled/stress-ng:latest-ubun
tu" stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=info msg="stress testing container" dryrun=false duration=1m0s id=f09136dc5356dcd9f95e5702012b171915dc1edaff118fe4ba867f8e6867de85 image="alexeiled/stress-ng:latest-ubuntu" name=/k8s_test-stress-3_test-s
tress_rcnit-pumba-testing_6e611a14-c9c0-4927-aa26-f230eca7c02d_0 pull=true stressors="[--cpu 2 --timeout 3m]"
time="2020-04-21T10:30:09Z" level=debug msg="executing stress-ng command" image="alexeiled/stress-ng:latest-ubuntu" pull=true stressors="[--cpu 2 --timeout 3m]" target=f09136dc5356dcd9f95e5702012b171915dc1edaff118fe4ba867f8e6867de85
time="2020-04-21T10:30:09Z" level=debug msg="pulling stress-ng image" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:10Z" level=debug msg="&{Pulling from alexeiled/stress-ng   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Digest: sha256:6c2f6a6997aeb0dc7f7299e556f67cbb7ca1b40398ef32bbeea50b90fc020ae3   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Status: Image is up to date for alexeiled/stress-ng:latest-ubuntu   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="creating stress-ng container" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:10Z" level=debug msg="&{Pulling from alexeiled/stress-ng   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Digest: sha256:6c2f6a6997aeb0dc7f7299e556f67cbb7ca1b40398ef32bbeea50b90fc020ae3   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Status: Image is up to date for alexeiled/stress-ng:latest-ubuntu   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Pulling from alexeiled/stress-ng   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Digest: sha256:6c2f6a6997aeb0dc7f7299e556f67cbb7ca1b40398ef32bbeea50b90fc020ae3   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Status: Image is up to date for alexeiled/stress-ng:latest-ubuntu   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="creating stress-ng container" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:10Z" level=debug msg="creating stress-ng container" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:10Z" level=debug msg="&{Pulling from alexeiled/stress-ng   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Digest: sha256:6c2f6a6997aeb0dc7f7299e556f67cbb7ca1b40398ef32bbeea50b90fc020ae3   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="&{Status: Image is up to date for alexeiled/stress-ng:latest-ubuntu   {0 0}}"
time="2020-04-21T10:30:10Z" level=debug msg="creating stress-ng container" image="alexeiled/stress-ng:latest-ubuntu"
time="2020-04-21T10:30:10Z" level=debug msg="stress-ng container created, starting it" id=35a29600fb7f49f6232f20e431cd8df9c407e05532b64dccd3b91c9a8adbbf1b
time="2020-04-21T10:30:10Z" level=debug msg="stress-ng container created, starting it" id=3227f31b3ae48e7ce9e9cb298d5eb19e01f35f3399b76c3dc381e62333a11d0c
time="2020-04-21T10:30:10Z" level=debug msg="stress-ng container created, starting it" id=78dd7526c88384bf22a04a450d7667f3ae897a6e214c066c28ca8753ccf01327
time="2020-04-21T10:30:10Z" level=debug msg="stress-ng container created, starting it" id=00943610e75c74ebb0f6c1b297ca3807a5f0e0cd8cd0e06879a688896312bb21
time="2020-04-21T10:30:16Z" level=fatal msg="one or more stress test failed: stress-ng failed with error: stress-ng exited with error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed\n"

Tried multiple variants of the YAML file (your DaemonSet included) and, after several clean minikube installs, the issue still persists.

Minicube version: 1.15.1
OS: Ubuntu 20.04
K8s Rev: 1.19.4

same issue

@ghost never tested this on Windows Minicube. Unfortunatly, I do not have any Windows environment. But if you can find the error and submit a PR, I can review it