replicatedhq/kots

`kubectl kots install <app--channel>` fails with default app and unstable channel

camilamacedo86 opened this issue · 4 comments

Description

Unable to run kubectl kots install <app--channel> following the getting started guide: https://docs.replicated.com/vendor/tutorial-installing-with-existing-cluster

Note that the app/release was created with the default files only and I am unable to start the admin console:

Enter the namespace to deploy to: dev4devs-unstable
  • Deploying Admin Console
    • Creating namespace ✓  
    • Waiting for datastore to be ready ✓  
Enter a new password to be used for the Admin Console: ••••••••••
    • Waiting for Admin Console to be ready ⠦Error: Failed to deploy: failed to deploy admin console: failed to wait for web: timeout waiting for deployment to become ready. Use the --wait-duration flag to increase timeout.

Environment

  • go 1.19.2
  • kind v0.16.0
  • k8s 1.25

Hello @camilamacedo86! This seems to indicate that some of the admin console resources failed to become ready during the installation. Performing a kubectl describe on any of those resources may point to a root cause (feel free to share those outputs here).

Also, we have a Support Bundle utility that will collect this kind of information from the cluster automatically. Here's a link to our docs for steps on how to generate one: https://docs.replicated.com/enterprise/troubleshooting-an-app#generating-a-bundle-using-the-cli. If you do generate one, you can provide the generated .tar.gz archive here as well.

Hi @cbodonnell,

Thank you a lot for your time and attention 🙏 .

Note that I am creating a new app, a new release without any change just to test it out

The error is Warning FailedScheduling 3m9s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. and I have only 1 node, see:

$ kubectl get node
NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   11m   v1.25.2

My guess is that it is failing because of:

kots/pkg/kurl/join_cert.go

Lines 230 to 241 in b43fa9a

Tolerations: []corev1.Toleration{
{
Key: "node-role.kubernetes.io/master",
Operator: corev1.TolerationOpExists,
Effect: corev1.TaintEffectNoSchedule,
},
{
Key: "node-role.kubernetes.io/control-plane",
Operator: corev1.TolerationOpExists,
Effect: corev1.TaintEffectNoSchedule,
},
},

See that I have the label node-role.kubernetes.io/control-plane on the node and if I spin the kind with 3 nodes I will face: Warning FailedScheduling 71s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

TL'DR: Following the outputs for your conference:

$ kubectl describe node/kind-control-plane
Name:               kind-control-plane
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=kind-control-plane
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 26 Oct 2022 18:38:40 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  kind-control-plane
  AcquireTime:     <unset>
  RenewTime:       Wed, 26 Oct 2022 18:54:30 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 26 Oct 2022 18:50:26 +0100   Wed, 26 Oct 2022 18:38:37 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 26 Oct 2022 18:50:26 +0100   Wed, 26 Oct 2022 18:38:37 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 26 Oct 2022 18:50:26 +0100   Wed, 26 Oct 2022 18:38:37 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 26 Oct 2022 18:50:26 +0100   Wed, 26 Oct 2022 18:39:03 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.18.0.2
  Hostname:    kind-control-plane
Capacity:
  cpu:                5
  ephemeral-storage:  263899620Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             12249368Ki
  pods:               110
Allocatable:
  cpu:                5
  ephemeral-storage:  263899620Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             12249368Ki
  pods:               110
System Info:
  Machine ID:                 578f01a3d92a440cb41ce30c8209912c
  System UUID:                578f01a3d92a440cb41ce30c8209912c
  Boot ID:                    42edd258-91f0-4309-adac-b07363ee04fc
  Kernel Version:             5.15.49-linuxkit
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.8
  Kubelet Version:            v1.25.2
  Kube-Proxy Version:         v1.25.2
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
ProviderID:                   kind://docker/kind/kind-control-plane
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-565d847f94-dknfs                      100m (2%)     0 (0%)      70Mi (0%)        170Mi (1%)     15m
  kube-system                 coredns-565d847f94-k87xs                      100m (2%)     0 (0%)      70Mi (0%)        170Mi (1%)     15m
  kube-system                 etcd-kind-control-plane                       100m (2%)     0 (0%)      100Mi (0%)       0 (0%)         15m
  kube-system                 kindnet-d5r2m                                 100m (2%)     100m (2%)   50Mi (0%)        50Mi (0%)      15m
  kube-system                 kube-apiserver-kind-control-plane             250m (5%)     0 (0%)      0 (0%)           0 (0%)         15m
  kube-system                 kube-controller-manager-kind-control-plane    200m (4%)     0 (0%)      0 (0%)           0 (0%)         15m
  kube-system                 kube-proxy-jxv2h                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         15m
  kube-system                 kube-scheduler-kind-control-plane             100m (2%)     0 (0%)      0 (0%)           0 (0%)         15m
  local-path-storage          local-path-provisioner-684f458cdd-9qv5n       0 (0%)        0 (0%)      0 (0%)           0 (0%)         15m
  test-python                 kotsadm-minio-0                               50m (1%)      100m (2%)   100Mi (0%)       512Mi (4%)     15m
  test-python                 kotsadm-postgres-0                            100m (2%)     200m (4%)   100Mi (0%)       200Mi (1%)     15m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1100m (22%)  400m (8%)
  memory             490Mi (4%)   1102Mi (9%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  hugepages-32Mi     0 (0%)       0 (0%)
  hugepages-64Ki     0 (0%)       0 (0%)
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  Starting                 15m                kube-proxy       
  Normal  NodeHasSufficientMemory  16m (x5 over 16m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    16m (x5 over 16m)  kubelet          Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     16m (x4 over 16m)  kubelet          Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  Starting                 15m                kubelet          Starting kubelet.
  Normal  NodeAllocatableEnforced  15m                kubelet          Updated Node Allocatable limit across pods
  Normal  NodeHasSufficientMemory  15m                kubelet          Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    15m                kubelet          Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     15m                kubelet          Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  RegisteredNode           15m                node-controller  Node kind-control-plane event: Registered Node kind-control-plane in Controller
  Normal  NodeReady                15m                kubelet          Node kind-control-plane status is now: NodeReady

Following all outputs:

$ kubectl get all -n test-python
NAME                          READY   STATUS    RESTARTS   AGE
pod/kotsadm-d74669fc9-rpj7r   0/1     Pending   0          2m20s
pod/kotsadm-minio-0           1/1     Running   0          2m46s
pod/kotsadm-postgres-0        1/1     Running   0          2m46s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kotsadm            ClusterIP   10.96.227.234   <none>        3000/TCP   2m20s
service/kotsadm-minio      ClusterIP   10.96.63.70     <none>        9000/TCP   2m45s
service/kotsadm-postgres   ClusterIP   10.96.180.57    <none>        5432/TCP   2m45s

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kotsadm   0/1     1            0           2m20s

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/kotsadm-d74669fc9   1         1         0       2m20s

NAME                                READY   AGE
statefulset.apps/kotsadm-minio      1/1     2m46s
statefulset.apps/kotsadm-postgres   1/1     2m46s
$ kubectl describe pod/kotsadm-d74669fc9-rpj7r -n test-python
Name:             kotsadm-d74669fc9-rpj7r
Namespace:        test-python
Priority:         0
Service Account:  kotsadm
Node:             <none>
Labels:           app=kotsadm
                  kots.io/backup=velero
                  kots.io/kotsadm=true
                  pod-template-hash=d74669fc9
Annotations:      backup.velero.io/backup-volumes: backup
                  pre.hook.backup.velero.io/command: ["/backup.sh"]
                  pre.hook.backup.velero.io/timeout: 10m
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/kotsadm-d74669fc9
Init Containers:
  schemahero-plan:
    Image:      kotsadm/kotsadm-migrations:v1.88.0
    Port:       <none>
    Host Port:  <none>
    Args:
      plan
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  50Mi
    Environment:
      SCHEMAHERO_DRIVER:     postgres
      SCHEMAHERO_SPEC_FILE:  /tables
      SCHEMAHERO_OUT:        /migrations/plan.yaml
      SCHEMAHERO_URI:        <set to the key 'uri' in secret 'kotsadm-postgres'>  Optional: false
    Mounts:
      /migrations from migrations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vvrx9 (ro)
  schemahero-apply:
    Image:      kotsadm/kotsadm-migrations:v1.88.0
    Port:       <none>
    Host Port:  <none>
    Args:
      apply
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  50Mi
    Environment:
      SCHEMAHERO_DRIVER:  postgres
      SCHEMAHERO_DDL:     /migrations/plan.yaml
      SCHEMAHERO_URI:     <set to the key 'uri' in secret 'kotsadm-postgres'>  Optional: false
    Mounts:
      /migrations from migrations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vvrx9 (ro)
  restore-db:
    Image:      kotsadm/kotsadm:v1.88.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /restore-db.sh
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     100m
      memory:  100Mi
    Environment:
      POSTGRES_PASSWORD:  <set to the key 'password' in secret 'kotsadm-postgres'>  Optional: false
    Mounts:
      /backup from backup (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vvrx9 (ro)
  restore-s3:
    Image:      kotsadm/kotsadm:v1.88.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /restore-s3.sh
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     100m
      memory:  100Mi
    Environment:
      S3_ENDPOINT:           http://kotsadm-minio:9000
      S3_BUCKET_NAME:        kotsadm
      S3_ACCESS_KEY_ID:      <set to the key 'accesskey' in secret 'kotsadm-minio'>  Optional: false
      S3_SECRET_ACCESS_KEY:  <set to the key 'secretkey' in secret 'kotsadm-minio'>  Optional: false
      S3_BUCKET_ENDPOINT:    true
    Mounts:
      /backup from backup (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vvrx9 (ro)
Containers:
  kotsadm:
    Image:      kotsadm/kotsadm:v1.88.0
    Port:       3000/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      100m
      memory:   100Mi
    Readiness:  http-get http://:3000/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SHARED_PASSWORD_BCRYPT:     <set to the key 'passwordBcrypt' in secret 'kotsadm-password'>              Optional: false
      AUTO_CREATE_CLUSTER_TOKEN:  <set to the key 'kotsadm-cluster-token' in secret 'kotsadm-cluster-token'>  Optional: false
      SESSION_KEY:                <set to the key 'key' in secret 'kotsadm-session'>                          Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'kotsadm-postgres'>                    Optional: false
      POSTGRES_URI:               <set to the key 'uri' in secret 'kotsadm-postgres'>                         Optional: false
      POD_NAMESPACE:              test-python (v1:metadata.namespace)
      POD_OWNER_KIND:             deployment
      API_ENCRYPTION_KEY:         <set to the key 'encryptionKey' in secret 'kotsadm-encryption'>  Optional: false
      API_ENDPOINT:               http://kotsadm.test-python.svc.cluster.local:3000
      API_ADVERTISE_ENDPOINT:     http://localhost:8800
      S3_ENDPOINT:                http://kotsadm-minio:9000
      S3_BUCKET_NAME:             kotsadm
      S3_ACCESS_KEY_ID:           <set to the key 'accesskey' in secret 'kotsadm-minio'>  Optional: false
      S3_SECRET_ACCESS_KEY:       <set to the key 'secretkey' in secret 'kotsadm-minio'>  Optional: false
      S3_BUCKET_ENDPOINT:         true
      HTTP_PROXY:                 
      HTTPS_PROXY:                
      NO_PROXY:                   kotsadm-postgres,kotsadm-minio,kotsadm-api-node
      KOTS_INSTALL_ID:            xxxxxxxxxx
    Mounts:
      /backup from backup (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vvrx9 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  migrations:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  backup:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-vvrx9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  3m9s  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

See as well the report with troboolshoting (but that does not bring any new/helpful info):

$ cat /Users/camilamacedo/support-bundle-results.txt 
Check PASS
Title: Required Kubernetes Version
Message: Your cluster meets the recommended and required versions of Kubernetes

------------
Check PASS
Title: Container Runtime
Message: A supported container runtime is present on all nodes

------------
Check FAIL
Title: Pod test-python/kotsadm-c668dc485-f7vdv status
Message: Status: Pending

------------
Check FAIL
Title: test-python/kotsadm Deployment Status
Message: The deployment test-python/kotsadm has 0/1 replicas

------------
Check FAIL
Title: test-python/kotsadm-c668dc485 ReplicaSet Status
Message: The replicaset test-python/kotsadm-c668dc485 is not ready

------------
Check PASS
Title: Node status check
Message: All nodes are online.

------------
camilamacedo@Camilas-MacBook-Pro ~/tmp $ 

Hi @camilamacedo86, I believe the following could be the issue (from the describe node):

System Info:
  Machine ID:                 578f01a3d92a440cb41ce30c8209912c
  System UUID:                578f01a3d92a440cb41ce30c8209912c
  Boot ID:                    42edd258-91f0-4309-adac-b07363ee04fc
  Kernel Version:             5.15.49-linuxkit
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               arm64

The kotsadm deployment has a node affinity applied so that it will only be scheduled on linux os and will not be scheduled on arm64 architecture (see this spot in the code). If you would like, you can edit the deployment and remove this, but I cannot guarantee that things will work since we don't support this combination at this time.

The support bundle command should have also generated a .tar.gz archive. This will contain lots of information about the cluster and it's resources that will assist with troubleshooting.

I hope this is helpful!

Hi @cbodonnell,

Thank you for your help to understand the issue. I tried to track the issues/rfes/suggestions in a better way. Please, feel free to check and let me know wdyt and/or how can I help.

About support arm64

It seems that it will not work out by removing the tolerant criteria, see: #896. It shows that the for we are able to support arm64 the images used must also be built for arm64.

Therefore, I open a new issue for it: (RFE) #3360

Also, I am prosing we add the info into the README for now so that we can avoid others facing the same, see: https://github.com/replicatedhq/kots/pull/3362/files

Node Affinity criteria also seem that would not work on kind by default

I believe that after solving that I might fail on another issue as well. See that when I change the kind for 3 nodes I faced the warning FailedScheduling 71s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Then, I checked that kind nodes do not have the label node-role.kubernetes.io/master which I understand would make it also make fail. So, to see if we could also change that to allow kots to work on vendors like the kind I created the issue: #3361

About support bundle command should have also generated a .tar.gz archive

I could find it really thank you. But, in this case, that does seem too much help either. I mean, by knowing that arm64 is not supported then we can know the reason for the problem but it has not a check like "validate cluster platform". I raised this one as one: replicatedhq/troubleshoot#805

Again, thank you a lot for your time and attention.
Closing this one since it seems like we could raise properly issues for each case scenario and project to be better addressed.