canonical/microk8s-core-addons

mayastor addon does not start correctly on arm64 platform

Opened this issue · 5 comments

Summary

When enabling mayastor addon on arm64 servers all pods get to running state except rest pod which stays in CrashLoopBackoff. If I monitor get pod -A I can see OOMKilled status from time to time.
Pools are in error state.
It is arm64 only issue, the same setup works fine on amd64.

What Should Happen Instead?

All pods should be in Running state and pools should be created.

Reproduction Steps

Yes, I can, but only on arm64.
Build a six node microk8s cluster, configure prerequisites as described in docs, enable mayastor addon.

Introspection Report

inspection-report-20230413_124122.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

Adding pod describe and logs from the pod

Name:             rest-77d69fb479-7w5m5                                                                                                                                                                                                
Namespace:        mayastor                                                                                                                                                                                                             
Priority:         0                                                                                                
Service Account:  default
Node:             sqa-lab2-node-3-arm/10.246.200.234     
Start Time:       Thu, 13 Apr 2023 10:57:55 +0000                                                                  
Labels:           app=rest                                                                                         
                  pod-template-hash=77d69fb479                                                                                                                                                                                         
Annotations:      cni.projectcalico.org/containerID: 965ba5a977e627a35c9053628b85b635373de515f16584d0ce0f4b9cd6f36c32                                                                                                                  
                  cni.projectcalico.org/podIP: 10.1.50.1/32                                                        
                  cni.projectcalico.org/podIPs: 10.1.50.1/32                                                       
Status:           Running           
IP:               10.1.50.1                                                                                                                                                                                                            
IPs:                                                                                                               
  IP:           10.1.50.1 
Controlled By:  ReplicaSet/rest-77d69fb479       
Init Containers:                                  
  grpc-probe:                                       
    Container ID:  containerd://e17ceb1c993de4f3267bd7b35afbe4909aa8918200bf6fdcb8743e6ba35d3a11                   
    Image:         busybox:1.28.4                     
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47                                                                                                                   
    Port:          <none>                                                                                          
    Host Port:     <none>                              
    Command:                                                                                                                                                                                                                           
      sh                                                                                                           
      -c                                                 
      trap "exit 1" TERM; until nc -vz core 50051; do echo "Waiting for grpc services..."; sleep 1; done;          
    State:          Terminated                    
      Reason:       Completed                  
      Exit Code:    0                                    
      Started:      Thu, 13 Apr 2023 10:58:04 +0000    
      Finished:     Thu, 13 Apr 2023 10:58:27 +0000      
    Ready:          True                                                                                           
    Restart Count:  0                                                                                              
    Environment:    <none>                                                                                         
    Mounts:       
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wgzg (ro)                                
  etcd-probe:                                            
    Container ID:  containerd://a4f35afbf0f75946c536fbaaed026a3cb6a3cff8c97dd25027ec8a4542e3f4f6
    Image:         busybox:1.28.4                                                                                  
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47                                                                                                                   
    Port:          <none>                                
    Host Port:     <none>  
    Command:                                                                                                       
      sh                                                                                                           
      -c                                                 
      trap "exit 1" TERM; until nc -vz etcd-client 2379; do echo "Waiting for etcd..."; sleep 1; done;             
    State:          Terminated                           
      Reason:       Completed                                                                                      
      Exit Code:    0                             
      Started:      Thu, 13 Apr 2023 10:58:28 +0000                                                                
      Finished:     Thu, 13 Apr 2023 10:58:28 +0000   
    Ready:          False
    Restart Count:  25
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     50m
      memory:  32Mi
    Environment:
      RUST_LOG:  info
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wgzg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-9wgzg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  3m55s (x470 over 108m)  kubelet  Back-off restarting failed container rest in pod rest-77d69fb479-7w5m5_mayastor(0c188899-9dc4-44c1-9d38-28f26d6878c4)
$ microk8s kubectl logs -f rest-77d69fb479-7w5m5 -n mayastor --all-containers
nc: bad address 'core'
rest version 1.0.0, git hash v1.0.0-156-g2d749934787a
Apr 13 12:44:35.661  INFO actix_server::builder: Starting 160 workers    
Apr 13 12:44:35.661  INFO actix_server::server: Actix runtime found; starting in Actix runtime    
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
Waiting for grpc services...
nc: bad address 'core'
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
nc: bad address 'core'
Waiting for grpc services...
core (10.1.121.193:50051) open
etcd-client (10.152.183.88:2379) open

Hi @marosg42, can you manually increase the resource limits to see what the cap is for the rest pod? microk8s kubectl edit can help you with this.

I have not been able to reproduce this issue, can you start doubling the limits until the pod does not crash? If it's just 1-2 times more, then I guess we can bump the limits, otherwise we would have to investigate further for possible memory leaks and/or a worse bug.

Thanks!

@neoaggelos I could not reproduce it on new 1.27 but 1.26 still gives the error
I will leave it up to you if you close the issue, I am ok with using 1.27

I tried to change limits out of curiosity but I got

# pods "rest-77d69fb479-x8zqj" was not valid:
# * spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)

Yes, you should not be changing the pods themselves, but rather the spec template for the deployment, with:

microk8s kubectl edit deploy/rest -n mayastor

Apologies for not making it clear in the original message!

on 1.26 with memory limit set to 512Mi it was still crashing, after switching to 1Gi it kept running.