Can not run e2e test on a modified cluster
STX5 opened this issue · 3 comments
What steps did you take and what happened:
I try to run sonobuoy in my cluster with this command
sonobuoy run --wait --kubeconfig kubeconfig.txt
and the result is like this
15:21:13 e2e global failed failed
15:21:13 systemd-logs cn-zhangjiakou.1.1.1.1 complete passed
15:21:13 systemd-logs cn-zhangjiakou.1.1.1.2 complete passed
15:21:13 systemd-logs cn-zhangjiakou.1.1.1.3 complete passed
15:21:13 systemd-logs cn-zhangjiakou.1.1.1.4 complete passed
I retrived the result and found that the e2e test failed on pulling image "registry.k8s.io/conformance:v1.18.8".
rpc error: code = unknown desc = error response from daemon: get https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/conformance/manifests/v1.18.8: dial tcp 74.125.23.82:443: i/o timeout
I assume this means my cluster has trouble accessing the public registry, so I pushed this image to my private registry
$ docker push alien-registry.alibaba-inc.com/sonobuoy:v0.56.16
$ docker push alien-registry.alibaba-inc.com/conformance:v1.18.8
then I used sonobuoy gen
to manually create a yaml file(I have to do so, because my private registry requires ImagePullSecret)
$ kubectl create secret docker-registry regcred --docker-server=alien-registry.alibaba-inc.com --docker-username={XXX} --docker-password={XXX}--namespace=sonobuoy --kubeconfig kubeconfig.txt
secret/regcred created
$ echo '{"ImagePullSecrets":"regcred"}' > secretconfig.json
$ sonobuoy gen --config secretconfig.json --kubeconfig kubeconfig.txt
--kube-conformance-image alien-registry.alibaba-inc.com/conformance:v1.18.8
-sonobuoy-image alien-registry.alibaba-inc.com/sonobuoy:v0.56.16 > test.yaml
then I started the test
$ sonobuoy run --wait --kubeconfig kubeconfig.txt -f test.yaml
...
...
PLUGIN STATUS RESULT COUNT PROGRESS
e2e failed failed 1
systemd-logs complete passed 4
I tried to inspect the cluster when sonobuoy is running
$ kubectl -n sonobuoy get pods --kubeconfig kubeconfig.txt
NAME READY STATUS RESTARTS AGE
sonobuoy 1/1 Running 0 6m49s
sonobuoy-e2e-job-0888f06407d84816 1/2 Error 0 6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-4spsp 2/2 Running 0 6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-6jmcj 2/2 Running 0 6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-n4ptk 2/2 Running 0 6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-nbmvv 2/2 Running 0 6m48s 0 115s
A close look at the error pod
$ kubectl -n sonobuoy describe pod sonobuoy-e2e-job-0888f06407d84816 --kubeconfig kubeconfig.txt
...
Containers:
e2e:
Container ID: docker://244457a29060c4ca0582dfde150c5cc416607003ab9854eb40dc127cdffed11f
Image: alien-registry.alibaba-inc.com/conformance:v1.18.8
Image ID: docker-pullable://alien-registry.alibaba-inc.com/conformance@sha256:15cd6405e4baaeb3d13b25a296115783bba53dacad2c9e06ef530c24c5860ff4
Port: <none>
Host Port: <none>
Command:
/run_e2e.sh
State: Terminated
Reason: Error
Exit Code: 126
Started: Fri, 09 Jun 2023 15:13:35 +0800
Finished: Fri, 09 Jun 2023 15:13:35 +0800
Ready: False
Restart Count: 0
...
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m1s default-scheduler Successfully assigned sonobuoy/sonobuoy-e2e-job-0888f06407d84816 to cn-zhangjiakou.172.16.0.178
Normal Pulled 8m kubelet Container image "alien-registry.alibaba-inc.com/conformance:v1.18.8" already present on machine
Normal Created 8m kubelet Created container e2e
Normal Started 8m kubelet Started container e2e
Normal Pulled 8m kubelet Container image "alien-registry.alibaba-inc.com/sonobuoy:v0.56.16" already present on machine
Normal Created 8m kubelet Created container sonobuoy-worker
Normal Started 8m kubelet Started container sonobuoy-worker
Normal Killing 25s kubelet Stopping container sonobuoy-worker
The image was successfully pulled, but it seems that the container didn't correctly start.
I retrived the result, but no useful error message was logged.
name: e2e
status: failed
meta:
type: summary
items:
- name: 'Container e2e is in a terminated state (exit code 126) due to reason: Error: '
status: failed
meta:
file: errors/global/error.json
details:
error: 'Container e2e is in a terminated state (exit code 126) due to reason:Error: '
What did you expect to happen:
Maybe more specific error message in the result file
Environment:
- Sonobuoy version:
Sonobuoy Version: v0.56.16
MinimumKubeVersion: 1.17.0
MaximumKubeVersion: 1.99.99
GitSHA:
GoVersion: go1.20.1
Platform: darwin/arm64
API Version: v1.18.8-aliyun.1
- Kubernetes version: (use
kubectl version
):
Server Version: v1.18.8-aliyun.1
- Kubernetes installer & version: aliyun
- Cloud provider or hardware configuration: aliyun
- OS (e.g. from
/etc/os-release
):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
- Sonobuoy tarball (which contains * below)
Hi @STX5,
Were you able/are you able to reproduce and capture the logs from the errored container? My guess at this stage is that what you're doing is correct although there is no construct for defining a private registry for the containers used by the e2e test. There is an open pull request #1893 that addresses this.
If the Sonobuoy e2e pod is using a private registry, it should be able to pull the images if the instructions are followed - https://sonobuoy.io/docs/v0.19.0/pullsecrets/ but the secret should be in the same ns as the e2e pod.
PR #1893 addresses the issue when the containers spawned by the e2e process are using a private registry, as they will be created in different namespaces with no access to secrets.
@franknstyle ,Thank you for your reply. I captured the logs from the errored container:
line 43: /gorunner: cannot execute binary file: Exec format error
The container image was pulled with command docker pull registry.k8s.io/conformance:v1.18.8
, and then pushed to private registry.(I'm using a M1 MacBook, the Cluster is x86 arch)
It was weird, because I had used docker inspect
to make sure I pulled the right image "Architecture": "amd64", "Os": "linux"
. And the other image sonobuoy wokred just fine.
Whatever, I switched back to my old Intel MacBook, and pushed another image to try again.
$ sonobuoy run --wait --kubeconfig kubeconfig.txt -f test.yaml
$ kubectl -n sonobuoy get pods --kubeconfig kubeconfig.txt
NAME READY STATUS RESTARTS AGE
sonobuoy 1/1 Running 0 24s
sonobuoy-e2e-job-fc17e4758ab743a8 2/2 Running 0 23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-2spvc 2/2 Running 0 23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-8j5rx 2/2 Running 0 23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-kq2ns 2/2 Running 0 23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-vrq79 2/2 Running 0
It looked good, but the test was not progressing. I looked at the log in the e2e-job pod:
...
Jun 12 08:02:03.017: INFO: ================================
Jun 12 08:02:33.016: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:02:33.016: INFO: Unschedulable nodes:
Jun 12 08:02:33.016: INFO: -> ˇ Ready=false Network=false Taints=[{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:02:33.016: INFO: ================================
Jun 12 08:03:03.017: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:03:03.017: INFO: Unschedulable nodes:
Jun 12 08:03:03.017: INFO: -> med-delay-monitor-node Ready=false Network=false Taints=[{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:03:03.017: INFO: ================================
Jun 12 08:03:33.017: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:03:33.017: INFO: Unschedulable nodes:
Jun 12 08:03:33.017: INFO: -> med-delay-monitor-node Ready=false Network=false Taints=[{node.kubernetes.io/unreachable NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:03:33.017: INFO: ================================
I think maybe @Divya063 's explanation make a point. The e2e test creates other namespaces, which causes the ImagePullSecret not working.
BTW, sonobuoy images pull
seems not working properly with some k8s versions
$ sonobuoy images pull --kubernetes-version v1.18.8
INFO[0000] e2e image to be used: registry.k8s.io/conformance:v1.18.8
ERRO[0000] failed to gather test images from e2e image: exit status 1
ERRO[0000] unable to collect images of plugins