mayastor addon does not start correctly on arm64 platform
Opened this issue · 5 comments
Summary
When enabling mayastor addon on arm64 servers all pods get to running state except rest
pod which stays in CrashLoopBackoff. If I monitor get pod -A
I can see OOMKilled status from time to time.
Pools are in error state.
It is arm64 only issue, the same setup works fine on amd64.
What Should Happen Instead?
All pods should be in Running state and pools should be created.
Reproduction Steps
Yes, I can, but only on arm64.
Build a six node microk8s cluster, configure prerequisites as described in docs, enable mayastor addon.
Introspection Report
inspection-report-20230413_124122.tar.gz
Can you suggest a fix?
Are you interested in contributing with a fix?
Adding pod describe and logs from the pod
Name: rest-77d69fb479-7w5m5
Namespace: mayastor
Priority: 0
Service Account: default
Node: sqa-lab2-node-3-arm/10.246.200.234
Start Time: Thu, 13 Apr 2023 10:57:55 +0000
Labels: app=rest
pod-template-hash=77d69fb479
Annotations: cni.projectcalico.org/containerID: 965ba5a977e627a35c9053628b85b635373de515f16584d0ce0f4b9cd6f36c32
cni.projectcalico.org/podIP: 10.1.50.1/32
cni.projectcalico.org/podIPs: 10.1.50.1/32
Status: Running
IP: 10.1.50.1
IPs:
IP: 10.1.50.1
Controlled By: ReplicaSet/rest-77d69fb479
Init Containers:
grpc-probe:
Container ID: containerd://e17ceb1c993de4f3267bd7b35afbe4909aa8918200bf6fdcb8743e6ba35d3a11
Image: busybox:1.28.4
Image ID: docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
Port: <none>
Host Port: <none>
Command:
sh
-c
trap "exit 1" TERM; until nc -vz core 50051; do echo "Waiting for grpc services..."; sleep 1; done;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Apr 2023 10:58:04 +0000
Finished: Thu, 13 Apr 2023 10:58:27 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wgzg (ro)
etcd-probe:
Container ID: containerd://a4f35afbf0f75946c536fbaaed026a3cb6a3cff8c97dd25027ec8a4542e3f4f6
Image: busybox:1.28.4
Image ID: docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
Port: <none>
Host Port: <none>
Command:
sh
-c
trap "exit 1" TERM; until nc -vz etcd-client 2379; do echo "Waiting for etcd..."; sleep 1; done;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Apr 2023 10:58:28 +0000
Finished: Thu, 13 Apr 2023 10:58:28 +0000
Ready: False
Restart Count: 25
Limits:
cpu: 100m
memory: 128Mi
Requests:
cpu: 50m
memory: 32Mi
Environment:
RUST_LOG: info
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wgzg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9wgzg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m55s (x470 over 108m) kubelet Back-off restarting failed container rest in pod rest-77d69fb479-7w5m5_mayastor(0c188899-9dc4-44c1-9d38-28f26d6878c4)
$ microk8s kubectl logs -f rest-77d69fb479-7w5m5 -n mayastor --all-containers
nc: bad address 'core'
rest version 1.0.0, git hash v1.0.0-156-g2d749934787a
Apr 13 12:44:35.661 INFO actix_server::builder: Starting 160 workers
Apr 13 12:44:35.661 INFO actix_server::server: Actix runtime found; starting in Actix runtime
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
Waiting for grpc services...
nc: bad address 'core'
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
core (10.1.121.193:50051) open
etcd-client (10.152.183.88:2379) open
Hi @marosg42, can you manually increase the resource limits to see what the cap is for the rest pod? microk8s kubectl edit
can help you with this.
I have not been able to reproduce this issue, can you start doubling the limits until the pod does not crash? If it's just 1-2 times more, then I guess we can bump the limits, otherwise we would have to investigate further for possible memory leaks and/or a worse bug.
Thanks!
@neoaggelos I could not reproduce it on new 1.27 but 1.26 still gives the error
I will leave it up to you if you close the issue, I am ok with using 1.27
I tried to change limits out of curiosity but I got
# pods "rest-77d69fb479-x8zqj" was not valid:
# * spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
Yes, you should not be changing the pods themselves, but rather the spec template for the deployment, with:
microk8s kubectl edit deploy/rest -n mayastor
Apologies for not making it clear in the original message!
on 1.26 with memory limit set to 512Mi it was still crashing, after switching to 1Gi it kept running.