bentoml/Yatai

Failed on deploying Yatai to EKS

bobmayuze opened this issue · 5 comments

I followed the documentation and have encountered several issues

  1. Failed on deploying postgreSQL
$ k events pods/postgresql-ha-postgresql-0 -n yatai-system
LAST SEEN              TYPE      REASON                 OBJECT                                                  MESSAGE
35m (x208 over 69m)    Warning   Unhealthy              Pod/yatai-6899664d9c-l2f6l                              Readiness probe failed: Get "http://172.31.89.159:7777/": dial tcp 172.31.89.159:7777: connect: connection refused
10m (x295 over 69m)    Warning   Unhealthy              Pod/yatai-6899664d9c-l2f6l                              Liveness probe failed: Get "http://172.31.89.159:7777/": dial tcp 172.31.89.159:7777: connect: connection refused
9m8s (x11 over 110m)   Warning   FailedScheduling       Pod/postgresql-ha-postgresql-1                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
9m8s (x11 over 110m)   Warning   FailedScheduling       Pod/postgresql-ha-postgresql-2                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
5m3s (x184 over 60m)   Warning   BackOff                Pod/yatai-6899664d9c-l2f6l                              Back-off restarting failed container
3m51s (x9 over 84m)    Warning   FailedScheduling       Pod/postgresql-ha-postgresql-0                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
26s (x482 over 120m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-0   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
26s (x482 over 120m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-1   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
26s (x483 over 120m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-2   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

Then I turned to setup aws rds to keep the process going. yet the final stage on pulling up yatai still failed
2. Failed on pulling up yatai

$ k events pods/yatai-6899664d9c-l2f6l -n yatai-system
LAST SEEN               TYPE      REASON                 OBJECT                                                  MESSAGE
36m (x208 over 71m)     Warning   Unhealthy              Pod/yatai-6899664d9c-l2f6l                              Readiness probe failed: Get "http://172.31.89.159:7777/": dial tcp 172.31.89.159:7777: connect: connection refused
11m (x295 over 71m)     Warning   Unhealthy              Pod/yatai-6899664d9c-l2f6l                              Liveness probe failed: Get "http://172.31.89.159:7777/": dial tcp 172.31.89.159:7777: connect: connection refused
5m22s (x9 over 86m)     Warning   FailedScheduling       Pod/postgresql-ha-postgresql-0                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
117s (x482 over 121m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-0   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
117s (x482 over 121m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-1   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
117s (x483 over 121m)   Normal    ExternalProvisioning   PersistentVolumeClaim/data-postgresql-ha-postgresql-2   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
86s (x196 over 62m)     Warning   BackOff                Pod/yatai-6899664d9c-l2f6l                              Back-off restarting failed container
28s (x12 over 111m)     Warning   FailedScheduling       Pod/postgresql-ha-postgresql-1                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
28s (x12 over 111m)     Warning   FailedScheduling       Pod/postgresql-ha-postgresql-2                          running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

Any help how can I keep it going? Thanks

First, you should check the PVC status:

kubectl -n yatai-system get pvc

If pvc are pending or failed, you should describe the PVC to get the reseaon:

kubectl -n yatai-system describe pvc $pvcName

Maybe you don't have any storageclass in your cluster or you do not have any storageclass provisioner

refs: https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/

I tried to describe the pvc and found out this

kubectl -n yatai-system get pvc
NAME                              STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-postgresql-ha-postgresql-0   Pending                                      gp2            13h
data-postgresql-ha-postgresql-1   Pending                                      gp2            13h
data-postgresql-ha-postgresql-2   Pending                                      gp2            13h

and to get a further detailed description, i did k -n yatai-system describe pvc data-postgresql-ha-postgresql-0 and got this

Name:          data-postgresql-ha-postgresql-0
Namespace:     yatai-system
StorageClass:  gp2
Status:        Pending
Volume:
Labels:        app.kubernetes.io/component=postgresql
               app.kubernetes.io/instance=postgresql-ha
               app.kubernetes.io/name=postgresql-ha
Annotations:   volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
               volume.kubernetes.io/selected-node: ip-172-31-21-121.ec2.internal
               volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       postgresql-ha-postgresql-0
Events:
  Type    Reason                Age                     From                         Message
  ----    ------                ----                    ----                         -------
  Normal  ExternalProvisioning  3m25s (x3203 over 13h)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

So I went to describe the storage class gp2, and got this

$ k describe storageclass gp2
Name:            gp2
IsDefaultClass:  Yes
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"gp2"},"parameters":{"fsType":"ext4","type":"gp2"},"provisioner":"kubernetes.io/aws-ebs","volumeBindingMode":"WaitForFirstConsumer"}
,storageclass.kubernetes.io/is-default-class=true
Provisioner:           kubernetes.io/aws-ebs
Parameters:            fsType=ext4,type=gp2
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>

Any guidance on creating the storage class here? This part was not mentioned in the installation doc

You should follow this AWS official documentation to setup the CSI driver on EKS and enable OIDC IAM in existing EKS Cluster

https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
https://stackoverflow.com/a/68725742

it worked, but I have to re-install yatai after the ebs-csi driver has been configured

@bobmayuze It is not necessary to reinstall, but to recreate the pvc before it is recognized by volume provisioner