canonical/notebook-operators

Cannot start notebook after cluster restart.

Barteus opened this issue · 9 comments

Reproduce:

  1. Install CKF using "Quick start".
  2. Create a notebook.
  3. Restart the machine/evict the notebook from the node.
  4. Try starting the notebook -> notebook is always in status "No Pod are currently running for this Notebook Server"

Expected:
Notebook starts

Environment:
OS - Ubuntu 20.04
microk8s - 1.22
Kubeflow - 1.6

this bug appears to be the same and has some good additional details

I managed to reproduce the issue. It looks like something is not correct with shutting down of notebook server when VM is stopped.
There is a workaround.

  • Before shutting down the VM, stop notebook server(s).
  • After the restart of the VM, start notebook server.

More investigation.
When there are two or more notebook servers, if just one of them is properly stopped (eg. via UI) before VM restart, all notebook servers can be restarted and connected to after VM is restarted.

Looking at canonical/bundle-kubeflow#515
I see that OS disk is 64GB which should be enough for Kubeflow deployment and OS.
However, when additional volumes are created we adding to total required. There are two volumes added to notebook server 10GB each. I was just wondering if that could cause some issue on the startup of notebook server pods when VM is restarted.

Can we confirm what disk sizes were used when this issue occurred?
Need to make sure that mount points for those notebook volumes are not eating into Kubeflow storage requirements.

After letting VM to sit in stopped state for couple of days I was able to reproduce the issue.
Notebook server with name new-test was left running before VM shutdown.
After the restart all pods in kubeflow and admin namespaces are in Running state.
Notebook server pod new-test does not exist (usually it was available and in Running state, that's when notebook server was accessible after VM restart).

This the info gathered after restart.

$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-operator-0
Name:         jupyter-controller-operator-0
Namespace:    kubeflow
Priority:     0
Node:         kf-test/10.128.0.14
Start Time:   Mon, 28 Nov 2022 15:01:36 +0000
Labels:       controller-revision-hash=jupyter-controller-operator-68fc77d85d
              operator.juju.is/name=jupyter-controller
              operator.juju.is/target=application
              statefulset.kubernetes.io/pod-name=jupyter-controller-operator-0
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              cni.projectcalico.org/podIP: 10.1.211.219/32
              cni.projectcalico.org/podIPs: 10.1.211.219/32
              controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
              juju.is/version: 2.9.34
              model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.211.219
IPs:
  IP:           10.1.211.219
Controlled By:  StatefulSet/jupyter-controller-operator
Containers:
  juju-operator:
    Container ID:  containerd://749cd9a3b114b2cb9d0142e3d9e6eafdbd7685b6f8853d2dd36e51d34f5ab09e
    Image:         jujusolutions/jujud-operator:2.9.34
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools
      
      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      
      $JUJU_TOOLS_DIR/jujud caasoperator --application-name=jupyter-controller --debug
      
    State:          Running
      Started:      Wed, 30 Nov 2022 19:11:18 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 28 Nov 2022 15:01:39 +0000
      Finished:     Wed, 30 Nov 2022 19:06:02 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      JUJU_APPLICATION:          jupyter-controller
      JUJU_OPERATOR_SERVICE_IP:  10.152.183.234
      JUJU_OPERATOR_POD_IP:       (v1:status.podIP)
      JUJU_OPERATOR_NAMESPACE:   kubeflow (v1:metadata.namespace)
    Mounts:
      /var/lib/juju/agents/application-jupyter-controller/operator.yaml from jupyter-controller-operator-config (rw,path="operator.yaml")
      /var/lib/juju/agents/application-jupyter-controller/template-agent.conf from jupyter-controller-operator-config (rw,path="template-agent.conf")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q4kkf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jupyter-controller-operator-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      jupyter-controller-operator-config
    Optional:  false
  kube-api-access-q4kkf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age                From     Message
  ----    ------          ----               ----     -------
  Normal  SandboxChanged  36m (x3 over 40m)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          35m                kubelet  Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
  Normal  Created         35m                kubelet  Created container juju-operator
  Normal  Started         35m                kubelet  Started container juju-operator
$ microk8s.kubectl -n kubeflow describe pod jupyter-controller-5d4949ddd7-ng4fp
Name:         jupyter-controller-5d4949ddd7-ng4fp
Namespace:    kubeflow
Priority:     0
Node:         kf-test/10.128.0.14
Start Time:   Mon, 28 Nov 2022 15:02:28 +0000
Labels:       app.kubernetes.io/name=jupyter-controller
              pod-template-hash=5d4949ddd7
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              charm.juju.is/modified-version: 0
              cni.projectcalico.org/podIP: 10.1.212.40/32
              cni.projectcalico.org/podIPs: 10.1.212.40/32
              controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
              model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
              seccomp.security.beta.kubernetes.io/pod: docker/default
              unit.juju.is/id: jupyter-controller/0
Status:       Running
IP:           10.1.212.40
IPs:
  IP:           10.1.212.40
Controlled By:  ReplicaSet/jupyter-controller-5d4949ddd7
Init Containers:
  juju-pod-init:
    Container ID:  containerd://fcb5a989f05f8a18a5acb2715e0575aa58f90e9f49a04d19c8963f20ea36555b
    Image:         jujusolutions/jujud-operator:2.9.34
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:3b46568ca590857dfa053ea84eea457a3389de34dd8775f0b32bfb2c0a55f700
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools
      
      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 30 Nov 2022 19:11:51 +0000
      Finished:     Wed, 30 Nov 2022 19:16:27 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Containers:
  jupyter-controller:
    Container ID:  containerd://570b2722a9f13f7be7f05b0fa6c6db5c944672551c49f19ae9674dfeafbc0771
    Image:         registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
    Image ID:      registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1
    Port:          <none>
    Host Port:     <none>
    Command:
      ./manager
    State:          Running
      Started:      Wed, 30 Nov 2022 19:16:34 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 28 Nov 2022 15:03:23 +0000
      Finished:     Wed, 30 Nov 2022 19:06:02 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      ENABLE_CULLING:  true
      ISTIO_GATEWAY:   kubeflow/kubeflow-gateway
      USE_ISTIO:       true
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dmpn2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-dmpn2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From     Message
  ----     ------                  ----               ----     -------
  Warning  FailedCreatePodSandBox  36m                kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2fe236223fa8f0dae56d0bb130386e980e4c70c9e9d68796b70809330df9597a": Get "https://[10.152.183.1]:443/apis/crd.projectcalico.org/v1/ipamconfigs/default": context deadline exceeded
  Normal   SandboxChanged          36m (x4 over 41m)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  36m                kubelet  Container image "jujusolutions/jujud-operator:2.9.34" already present on machine
  Normal   Created                 36m                kubelet  Created container juju-pod-init
  Normal   Started                 36m                kubelet  Started container juju-pod-init
  Normal   Pulled                  31m                kubelet  Container image "registry.jujucharms.com/charm/kaq3thscd44n4eitar0ng5vn41qku3s076d4l/oci-image@sha256:8f4ec330927552d3bce6478f231c5415d01e019ad2b04b9c79565692dce360c1" already present on machine
  Normal   Created                 31m                kubelet  Created container jupyter-controller
  Normal   Started                 31m                kubelet  Started container jupyter-controller

Logs from Jupyter pods are attached.
jupyter-controller-pod.log
jupyter-controller-operator-pod.log

Newly created notebook server had access to previous runs that were done in notebook server that did not restart (new-test).

Leaving VM in stopped state overnight, usually results in notebook servers that do not startup. Waiting for 4-6 hours does not always result in the same behaviour.

Verified with Kubeflow 1.6 K8S 1.24.
After restart of VM with MicroK8S cluster all service came up and connection to notebook server could be made.
Details of deployment:

microk8s status        
dns        
ha-cluster        
hostpath-storage        
ingress        
metallb        
storage        
         
juju status        
App Version Charm Channel Rev
admission-webhook res:oci-image@129fe92 admission-webhook 1.6/stable 60
argo-controller res:oci-image@669ebd5 argo-controller 3.3/stable 99
argo-server res:oci-image@576d038 argo-server 3.3/stable 45
dex-auth   dex-auth 2.31/stable 129
istio-ingressgateway   istio-gateway 1.11/stable 114
istio-pilot   istio-pilot 1.11/stable 131
jupyter-controller res:oci-image@e05857e jupyter-controller 1.6/stable 163
jupyter-ui res:oci-image@d55c600 jupyter-ui 1.6/stable 124
katib-controller res:oci-image@03d47fb katib-controller 0.14/stable 92
katib-db mariadb/server:10.3 charmed-osm-mariadb-k8s latest/stable 35
katib-db-manager res:oci-image@16b33a5 katib-db-manager 0.14/stable 66
katib-ui res:oci-image@c7dc04a katib-ui 0.14/stable 90
kfp-api res:oci-image@bf747d5 kfp-api 2.0/stable 144
kfp-db mariadb/server:10.3 charmed-osm-mariadb-k8s latest/stable 35
kfp-persistence res:oci-image@abcf971 kfp-persistence 2.0/stable 141
kfp-profile-controller res:oci-image@b4de878 kfp-profile-controller 2.0/stable 125
kfp-schedwf res:oci-image@9c9f710 kfp-schedwf 2.0/stable 155
kfp-ui res:oci-image@47864af kfp-ui 2.0/stable 144
kfp-viewer res:oci-image@94754c0 kfp-viewer 2.0/stable 152
kfp-viz res:oci-image@23ab9b9 kfp-viz 2.0/stable 134
kubeflow-dashboard res:oci-image@6fe6eec kubeflow-dashboard 1.6/stable 183
kubeflow-profiles res:profile-image@cfd6935 kubeflow-profiles 1.6/stable 94
kubeflow-roles   kubeflow-roles 1.6/stable 49
kubeflow-volumes res:oci-image@fdb4a5d kubeflow-volumes 1.6/stable 84
metacontroller-operator   metacontroller-operator 2.0/stable 48
minio res:oci-image@1755999 minio ckf-1.6/stable 99
mlflow-db mariadb/server:10.3 charmed-osm-mariadb-k8s stable 35
mlflow-server res:oci-image@bba33cd mlflow-server stable 77
oidc-gatekeeper res:oci-image@32de216 oidc-gatekeeper ckf-1.6/stable 76
seldon-controller-manager res:oci-image@eb811b6 seldon-core 1.14/stable 92
tensorboard-controller res:oci-image@51058f7 tensorboard-controller 1.6/stable 69
tensorboards-web-app res:oci-image@eef68a5 tensorboards-web-app 1.6/stable 71
training-operator   training-operator 1.5/stable 65

If problem occurs again. New issue will be opened. Closing.