Problem installing with the yaml files [BUG]
Closed this issue · 7 comments
saeid93 commented
What happened:
I cannot install kubedl using the provided yaml installation but the kubedl pod seems to have problem starting.
following the steps described for Yaml installation
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl apply -f https://raw.githubusercontent.com/kubedl-io/kubedl/master/config/manager/all_in_one.yaml
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcd69978-7vr9g 1/1 Running 1 (170m ago) 2d16h
kube-system etcd-minikube 1/1 Running 1 (170m ago) 2d16h
kube-system kube-apiserver-minikube 1/1 Running 1 (170m ago) 2d16h
kube-system kube-controller-manager-minikube 1/1 Running 1 (170m ago) 2d16h
kube-system kube-proxy-g55r9 1/1 Running 1 (170m ago) 2d16h
kube-system kube-scheduler-minikube 1/1 Running 1 (170m ago) 2d16h
kube-system metrics-server-77c99ccb96-9lqn8 1/1 Running 1 (170m ago) 2d16h
kube-system storage-provisioner 1/1 Running 3 (164m ago) 2d16h
kubedl-system kubedl-74bd95588b-28q8n 0/1 CrashLoopBackOff 8 (40s ago) 16m
kubernetes-dashboard dashboard-metrics-scraper-5594458c94-pmcdr 1/1 Running 1 (170m ago) 3h55m
kubernetes-dashboard kubernetes-dashboard-654cf69797-cbcbf 1/1 Running 2 (164m ago) 3h55m
Environment:
- KubeDL version:
- Kubernetes version (use
kubectl version
): v1.22.3 - OS (e.g:
cat /etc/os-release
): Ubuntu 20.04.3 LTS - Kernel (e.g.
uname -a
):Ubuntu 20.04.3 LTS - Install tools: kubectl
SimonCqk commented
@saeid93 have you install CRD manifests files before? and could you provide output logs of kubedl
which helps us positioning problems.
saeid93 commented
@SimonCqk I think I have the crds installed correctly:
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl get crd
NAME CREATED AT
crons.apps.kubedl.io 2021-11-17T12:33:48Z
elasticdljobs.training.kubedl.io 2021-11-17T12:33:48Z
inferences.serving.kubedl.io 2021-11-17T12:33:48Z
marsjobs.training.kubedl.io 2021-11-17T12:33:48Z
models.model.kubedl.io 2021-11-17T12:33:48Z
modelversions.model.kubedl.io 2021-11-17T12:33:48Z
mpijobs.training.kubedl.io 2021-11-17T12:33:49Z
pytorchjobs.training.kubedl.io 2021-11-17T12:33:51Z
tfjobs.training.kubedl.io 2021-11-17T12:33:52Z
xdljobs.training.kubedl.io 2021-11-17T12:33:52Z
xgboostjobs.training.kubedl.io 2021-11-17T12:33:52Z
Here is the output to the logs and describe:
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl -n kubedl-system describe pod kubedl-74bd95588b-28q8n
Name: kubedl-74bd95588b-28q8n
Namespace: kubedl-system
Priority: 0
Node: minikube/192.168.49.2
Start Time: Wed, 17 Nov 2021 07:34:03 -0500
Labels: app=kubedl
pod-template-hash=74bd95588b
Annotations: <none>
Status: Running
IP: 172.17.0.6
IPs:
IP: 172.17.0.6
Controlled By: ReplicaSet/kubedl-74bd95588b
Containers:
kubedl-manager:
Container ID: docker://cf71cfe48ab6a6f4ae276e71012a9917b89ba2d36a10d85fe3caf6003201849e
Image: kubedl/kubedl:daily
Image ID: docker-pullable://kubedl/kubedl@sha256:a4e5651476c62bd51c986cc159b7fad619779e9bbf10fb58f3907cd36dfe7069
Ports: 8080/TCP, 9876/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "204800": write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod10a127bd-3822-4dee-af98-ab62adee548d/cf71cfe48ab6a6f4ae276e71012a9917b89ba2d36a10d85fe3caf6003201849e/cpu.cfs_quota_us: invalid argument: unknown
Exit Code: 128
Started: Wed, 17 Nov 2021 08:05:10 -0500
Finished: Wed, 17 Nov 2021 08:05:10 -0500
Ready: False
Restart Count: 11
Limits:
cpu: 2048m
memory: 2Gi
Requests:
cpu: 1024m
memory: 1Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9xqrs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9xqrs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34m default-scheduler Successfully assigned kubedl-system/kubedl-74bd95588b-28q8n to minikube
Normal Pulled 32m (x5 over 34m) kubelet Container image "kubedl/kubedl:daily" already present on machine
Normal Created 32m (x5 over 34m) kubelet Created container kubedl-manager
Warning Failed 32m (x5 over 34m) kubelet Error: failed to start container "kubedl-manager": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "204800": write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod10a127bd-3822-4dee-af98-ab62adee548d/kubedl-manager/cpu.cfs_quota_us: invalid argument: unknown
Warning BackOff 4m14s (x133 over 33m) kubelet Back-off restarting failed container
For logs it seems to be empty:
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$ kubectl -n kubedl-system logs -p kubedl-74bd95588b-28q8n
saeid@saeid-OptiPlex-7440-AIO:~/codes/kubedl$
SimonCqk commented
@SimonCqk Do you have any suggestions about that?
taint this node and schedule kubedl pod to another node? or restart kubelet on this problematic node.