kubedl-io/kubedl

Problem installing with the yaml files [BUG]

Closed this issue · 7 comments

What happened:
I cannot install kubedl using the provided yaml installation but the kubedl pod seems to have problem starting.
following the steps described for Yaml installation

saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl apply -f https://raw.githubusercontent.com/kubedl-io/kubedl/master/config/manager/all_in_one.yaml
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl get pods --all-namespaces
NAMESPACE              NAME                                         READY   STATUS             RESTARTS       AGE
kube-system            coredns-78fcd69978-7vr9g                     1/1     Running            1 (170m ago)   2d16h
kube-system            etcd-minikube                                1/1     Running            1 (170m ago)   2d16h
kube-system            kube-apiserver-minikube                      1/1     Running            1 (170m ago)   2d16h
kube-system            kube-controller-manager-minikube             1/1     Running            1 (170m ago)   2d16h
kube-system            kube-proxy-g55r9                             1/1     Running            1 (170m ago)   2d16h
kube-system            kube-scheduler-minikube                      1/1     Running            1 (170m ago)   2d16h
kube-system            metrics-server-77c99ccb96-9lqn8              1/1     Running            1 (170m ago)   2d16h
kube-system            storage-provisioner                          1/1     Running            3 (164m ago)   2d16h
kubedl-system          kubedl-74bd95588b-28q8n                      0/1     CrashLoopBackOff   8 (40s ago)    16m
kubernetes-dashboard   dashboard-metrics-scraper-5594458c94-pmcdr   1/1     Running            1 (170m ago)   3h55m
kubernetes-dashboard   kubernetes-dashboard-654cf69797-cbcbf        1/1     Running            2 (164m ago)   3h55m

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version): v1.22.3
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS
  • Kernel (e.g. uname -a):Ubuntu 20.04.3 LTS
  • Install tools: kubectl

@saeid93 have you install CRD manifests files before? and could you provide output logs of kubedl which helps us positioning problems.

@SimonCqk I think I have the crds installed correctly:

saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl get crd
NAME                               CREATED AT
crons.apps.kubedl.io               2021-11-17T12:33:48Z
elasticdljobs.training.kubedl.io   2021-11-17T12:33:48Z
inferences.serving.kubedl.io       2021-11-17T12:33:48Z
marsjobs.training.kubedl.io        2021-11-17T12:33:48Z
models.model.kubedl.io             2021-11-17T12:33:48Z
modelversions.model.kubedl.io      2021-11-17T12:33:48Z
mpijobs.training.kubedl.io         2021-11-17T12:33:49Z
pytorchjobs.training.kubedl.io     2021-11-17T12:33:51Z
tfjobs.training.kubedl.io          2021-11-17T12:33:52Z
xdljobs.training.kubedl.io         2021-11-17T12:33:52Z
xgboostjobs.training.kubedl.io     2021-11-17T12:33:52Z

Here is the output to the logs and describe:

saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl -n kubedl-system describe pod kubedl-74bd95588b-28q8n
Name:         kubedl-74bd95588b-28q8n
Namespace:    kubedl-system
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Wed, 17 Nov 2021 07:34:03 -0500
Labels:       app=kubedl
              pod-template-hash=74bd95588b
Annotations:  <none>
Status:       Running
IP:           172.17.0.6
IPs:
  IP:           172.17.0.6
Controlled By:  ReplicaSet/kubedl-74bd95588b
Containers:
  kubedl-manager:
    Container ID:   docker://cf71cfe48ab6a6f4ae276e71012a9917b89ba2d36a10d85fe3caf6003201849e
    Image:          kubedl/kubedl:daily
    Image ID:       docker-pullable://kubedl/kubedl@sha256:a4e5651476c62bd51c986cc159b7fad619779e9bbf10fb58f3907cd36dfe7069
    Ports:          8080/TCP, 9876/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "204800": write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod10a127bd-3822-4dee-af98-ab62adee548d/cf71cfe48ab6a6f4ae276e71012a9917b89ba2d36a10d85fe3caf6003201849e/cpu.cfs_quota_us: invalid argument: unknown
      Exit Code:    128
      Started:      Wed, 17 Nov 2021 08:05:10 -0500
      Finished:     Wed, 17 Nov 2021 08:05:10 -0500
    Ready:          False
    Restart Count:  11
    Limits:
      cpu:     2048m
      memory:  2Gi
    Requests:
      cpu:        1024m
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9xqrs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-9xqrs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  34m                    default-scheduler  Successfully assigned kubedl-system/kubedl-74bd95588b-28q8n to minikube
  Normal   Pulled     32m (x5 over 34m)      kubelet            Container image "kubedl/kubedl:daily" already present on machine
  Normal   Created    32m (x5 over 34m)      kubelet            Created container kubedl-manager
  Warning  Failed     32m (x5 over 34m)      kubelet            Error: failed to start container "kubedl-manager": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "204800": write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod10a127bd-3822-4dee-af98-ab62adee548d/kubedl-manager/cpu.cfs_quota_us: invalid argument: unknown
  Warning  BackOff    4m14s (x133 over 33m)  kubelet            Back-off restarting failed container

For logs it seems to be empty:

saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl -n kubedl-system logs -p kubedl-74bd95588b-28q8n
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ 

@saeid93 it seems that kubelet can not setup cgroup files correctly.

@SimonCqk Do you have any suggestions about that?

@SimonCqk Do you have any suggestions about that?

taint this node and schedule kubedl pod to another node? or restart kubelet on this problematic node.

Thanks @SimonCqk, yes it was some problem with my cluster. I finally managed to install Kubedl.

@saeid93 you're welcome, feel free to ask us anything!