grafana/xk6-disruptor

Error invoking agent: unable to upgrade connection: container not found ("xk6-agent")

Closed this issue · 6 comments

When injecting faults in the quick pizza application

ERRO[0000] GoError: error injecting fault: error invoking agent: unable to upgrade connection: container not found ("xk6-agent")

The fault injection is done with this code

export function disrupt() {
  const disruptor = new ServiceDisruptor("pizza-info", "pizza-ns");
  const targets = disruptor.targets();
  if (targets.length == 0) {
    throw new Error("expected list to have one target");
  }

  disruptor.injectHTTPFaults({ averageDelay: "1000ms" }, "40s");
}

A quick inspection of the application reveals the target pod had restarted

kubectl get all -n pizza-ns
NAME                             READY   STATUS    RESTARTS      AGE
pod/pizza-info-9b788866d-9m28p   1/1     Running   1 (26m ago)   6h9m
pod/pizza-info-9b788866d-xz9pw   1/1     Running   1 (26m ago)   6h9m

NAME                 TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
service/pizza-info   LoadBalancer   10.98.215.80   127.0.0.1     3333:32247/TCP   6h9m

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pizza-info   2/2     2            2           6h9m

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/pizza-info-9b788866d   2         2         2       6h9m
roobre commented

As a side note, I can reproduce this message if I manually stop the xk6-disruptor ephemeral container. I this case I stopped it manually by running

kubectl -n pizza-ns attach -ti pizza-info-9b788866d-6vv5c -c xk6-agent

And then exit in the shell.

Describe for the pod in which I manually exited the container:

ERRO[0040] GoError: error injecting fault: error invoking agent: unable to upgrade connection: container not found ("xk6-agent")

load    at reflect.methodValueCall (native)
	at disrupt (file:///home/roobre/Devel/quickpizza/k6/12.basic-with-delay.js:35:29(26))  executor=shared-iterations scenario=disrupt source=stacktrace
11:54:57 ~ $> kc describe -n pizza-ns pod pizza-info-9b788866d-6vv5c
Name:             pizza-info-9b788866d-6vv5c
Namespace:        pizza-ns
Priority:         0
Service Account:  pizza-info
Node:             minikube/192.168.49.2
Start Time:       Thu, 22 Jun 2023 11:48:16 +0200
Labels:           app=pizza-info
                  pod-template-hash=9b788866d
Annotations:      <none>
Status:           Running
IP:               10.244.0.14
IPs:
  IP:           10.244.0.14
Controlled By:  ReplicaSet/pizza-info-9b788866d
Containers:
  pizza-info:
    Container ID:   containerd://3da6065074e9b589a5eca6af23fc6625c8f5ddf93b87e34cf63054053cfd7245
    Image:          ghcr.io/grafana/quickpizza-local:latest
    Image ID:       ghcr.io/grafana/quickpizza-local@sha256:eb4d7eb611b52f25f3647888bce564185d16ce8a1cf2fdf8ec17ec1919cd3362
    Port:           3333/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 22 Jun 2023 11:48:17 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:             (v1:spec.nodeName)
      POD_NAME:             pizza-info-9b788866d-6vv5c (v1:metadata.name)
      POD_NAMESPACE:        pizza-ns (v1:metadata.namespace)
      POD_IP:                (v1:status.podIP)
      POD_SERVICE_ACCOUNT:   (v1:spec.serviceAccountName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xv8gf (ro)
Ephemeral Containers:
  xk6-agent:
    Container ID:   containerd://a5e96ac0221eceb4d6b869d0e3933a6ca62ca38703d436fc4c86ca624456cfe6
    Image:          ghcr.io/grafana/xk6-disruptor-agent:v0.3.1
    Image ID:       ghcr.io/grafana/xk6-disruptor-agent@sha256:67094b9d6fe81a57c6ab99d672acf9a31ff56033389dcb02b3679cd4f6815fc5
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Jun 2023 11:49:10 +0200
      Finished:     Thu, 22 Jun 2023 11:50:36 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-xv8gf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  6m50s  default-scheduler  Successfully assigned pizza-ns/pizza-info-9b788866d-6vv5c to minikube
  Normal  Pulling    6m50s  kubelet            Pulling image "ghcr.io/grafana/quickpizza-local:latest"
  Normal  Pulled     6m49s  kubelet            Successfully pulled image "ghcr.io/grafana/quickpizza-local:latest" in 611.748637ms (611.810066ms including waiting)
  Normal  Created    6m49s  kubelet            Created container pizza-info
  Normal  Started    6m49s  kubelet            Started container pizza-info
  Normal  Pulled     5m56s  kubelet            Container image "ghcr.io/grafana/xk6-disruptor-agent:v0.3.1" already present on machine
  Normal  Created    5m56s  kubelet            Created container xk6-agent
  Normal  Started    5m56s  kubelet            Started container xk6-agent

I can reproduce this message if I manually stop the xk6-disruptor ephemeral container. I this case I stopped it manually by running

kubectl -n pizza-ns attach -ti pizza-info-9b788866d-6vv5c -c xk6-agent

And then exit in the shell.

This is weird because I've done it many times when debugging or testing the agent manually. This is even documented.

I wonder if this is a regression.

roobre commented

@pablochacin I think that by doing attach instead of exec, I can kill the container on exit. I believe this happens because when I run attach I hook into the shell that is running as the container entrypoint, instead of creating a new one like exec does.

I think that by doing attach instead of exec

I haven't noticed the attach

I have found this same behavior in an unrelated test using the Inject HTTP faults example. The test used a kind cluster deployed following the instructions found at the xk6-disruptor-demo.

In this case, the ephemeral container seems to had terminated "spontaneously"

  ephemeralContainerStatuses:
  - containerID: containerd://82214f374b6e0924de886d14fd83f3be83f746b0c63255e9ceb47bb568a7467e
    image: ghcr.io/grafana/xk6-disruptor-agent:latest
    imageID: ghcr.io/grafana/xk6-disruptor-agent@sha256:146dcfb9636e0be662a501171ef2bdcaa8e1b67ec1a05a99a0818495303637b3
    lastState: {}
    name: xk6-agent
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://82214f374b6e0924de886d14fd83f3be83f746b0c63255e9ceb47bb568a7467e
        exitCode: 255
        finishedAt: "2023-06-21T16:21:17Z"
        reason: Unknown
        startedAt: "2023-06-21T12:13:24Z"

Something I noticed is that the kind's cluster apparently was restarted at some point. This can be seen from the difference between the created and up times reported by docker for the container on which the test cluster runs:

docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED        STATUS        PORTS                                              NAMES
ef502cd89b15   kindest/node:v1.24.0   "/usr/local/bin/entr…"   23 hours ago   Up 18 hours   0.0.0.0:38080->80/tcp, 127.0.0.1:35971->6443/tcp   demo-control-plane

Coincidentally the up time matches the uptime for my workstation

 uptime
 12:44:50 up 18:24,  1 user,  load average: 2,57, 2,94, 2,79
roobre commented

I can reproduce this as well. Stopping the cluster (e.g. minikube stop) and starting it again will cause the ephemeral container to exit, which in turn causes the error described in the issue.

I guess this, in fact, happens to all containers when the CRI stops. The difference might be that the kubelet does restart the main container, but not the ephemeral one.