pod gets immediatly killed after beeing running

Question

pod gets immediatly killed after beeing running

Closed this issue 10 months ago · 10 comments

I have lxe, lxc and lxd installed on my system. I have followed the instructions here. But when i ran the command it gave me an error
that failed to pull image like kube api server and all.

ERROR  [10-17|14:51:41.838] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-apiserver%v1.28.2"
ERROR  [10-17|14:51:41.854] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-apiserver%v1.28.2"
ERROR  [10-17|14:51:41.872] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-apiserver%v1.28.2"
ERROR  [10-17|14:51:41.888] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-apiserver%v1.28.2"
ERROR  [10-17|14:51:41.903] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-apiserver%v1.28.2"
ERROR  [10-17|14:51:41.936] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-controller-manager%v1.28.2"
ERROR  [10-17|14:51:41.952] PullImage: failed to pull image               error="The remote \"registry.k8s.io\" doesn't exist" image="registry.k8s.io:kube-controller-manager%v1.28.2"

Am i missing something?

Answer 1 · 2023-10-17T12:15:17.000Z

Yes, LXE is a wrapper around LXD. The key conceptual issue background is LXD only does system containers, which means LXD (and thus also LXE) can't handle docker images.

Your easiest way to start is keeping your master node(s) on docker (or lxcri, which is a wrapper around LXC) and only use LXE on nodes where you want it explicitly. There are no container images for any kube binaries in LXD available so you would have to make them yourself, which takes a lot of effort.

This concept of having a master node on docker and then the additional node(s) with LXE is used in this example with k3s. Tipp: Don't forget to taint the additional nodes accordingly to not mix pods for docker and pods for LXD (also described in this example).

You probably can make this work with kubeadm as well. You'd start the cluster with "normal" docker nodes and then extend the cluster with LXE nodes.

Answer 2 · 2023-10-18T10:59:14.000Z

Thanks, alot. I have followed the example and have these couple of issues now
These are my node on lxc --vm. I have followed the example until the end.
nodek2 Ready <none> 3h14m v1.27.6+k3s1 10.113.196.20 <none> Ubuntu 22.04.3 LTS 5.15.0-1044-kvm lxe://0.0.0

nodek1 Ready control-plane,master 3h31m v1.27.6+k3s1 10.113.196.8 <none> Ubuntu 22.04.3 LTS 5.15.0-1044-kvm containerd://1.7.6-k3s1.27

I have tried to create a simple pod mentioned in the example but got various issues.
It's not cleaning its own lxd containers when i delete the pod saying it container still have /boot/ dir.
Also snap related issues i guess. I will add /var/log/syslog from node2 which is on lxe sock from time when i create pod until destroying it. The pod is in constant loop of creating new containers.

P.s. Actually i am trying to use my local LXC images with k8. I couldn't do it directly, and upon searching i came to know that i have to use lxe because the containerd is not okay with lxc containers. Although i have tried lxe with kubernetes but it can't download the schedulor,api server.. the master node component.
This example works well for my use case or if you have any insight for my usecase.

tail -f /var/log/syslog

kubectlogs.txt

Answer 3 · 2023-10-18T18:37:08.000Z

I see. I've replicated the example and got

Oct 18 18:17:14 node2 k3s[10566]: Error: failed to parse kubelet flag: unknown flag: --container-runtime

Leading node2's kubelet to not start up.

There have been changes to kubelet arguments in today's kubernetes version. I'll update the example in a bit. For now this seems to get the node up and running:

root@node2:~# cat <<EOF >/etc/rancher/k3s/config.yaml
kubelet-arg:
  - "container-runtime-endpoint=unix:///run/lxe.sock"
EOF                                                  
root@node2:~# systemctl restart k3s-agent

It now registers correctly and is Ready:

root@node1:~# kubectl get node node2 -o wide
NAME    STATUS   ROLES    AGE   VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
node2   Ready    <none>   51m   v1.27.6+k3s1   10.166.4.118   <none>        Ubuntu 22.04.3 LTS   5.15.0-1044-kvm   lxe://0.0.0

Running into a loop of creating the pod on the CRI and gets deleted again. Continue to investigate

Answer 4 · 2023-10-19T04:08:16.000Z

Thanks, Did you check the logs probably related to storage or Cri can't access the storage? I am not really sure but yeah, waiting for your response.
Also yeah you don't need --container-runtime after v1.24 in your Kubernetes config file.

Answer 5 · 2023-10-19T12:54:53.000Z

So I see multiple issues:

pod/container get instantly stopped after it is running (There's an event with reason Killing for the pod/ubuntu). I haven't yet figured out the real reason for this and digged deep into the kubelet logs, as this "decision" must've been made by kubelet beforehand, but wasn't able to pinpoint an exact cause yet. Additionally there is this long standing issue that it is always difficult to identify the reason for killing a pod, which makes debugging these things always hard.
I have seen the mentioned failed to remove container "xyz" log "boot": remove boot: directory not empty", but so far it was in conjunction with the above issue. It may be too early for lxd to get it deleted, but can't say for sure. Deleting the pod with crictl was fine (which happens naturally a bit afterwards).

For whatever reason, the exact k3s version used at the time of writing the example works perfectly fine (and earlier versions as far I remember). And lxd version 5.0.2, the current stable LTS release, seems to be fine.

So to play around with lxe, use curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.6+k3s1" sh - on node1 and curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.6+k3s1" K3S_URL=https://node1.lxd:6443 K3S_TOKEN=...sh - on node2 - until we have resolved the issues. E.g. the stopping issue mentioned above already happens with v1.23.17+k3s1. There must've been a change in handling/interpreting the pod/container CRI state.

Answer 6 · 2023-10-19T14:19:53.000Z

To actually address your usecase:

Although i have tried lxe with kubernetes but it can't download the schedulor,api server.. the master node component.

That is correct. As LXD can't work with the docker/OCI containers. So instead of doing those all by hand, the example is showing that we can keep the master node(s) on docker and use separate nodes for lxd containers. You can use any orchestrator or manual setup as long as you can split the k8s node's by their purpose.

Actually i am trying to use my local LXC images with k8s

This is a perfect use case: You can have/setup your own lxd remote hosting your LXD image. Unfortunately we can't "fake" a localhost remote as lxd doesn't allow to copy images on the same host, which the CRI ImagePull request would task LXE to do. There is a way we could detect and code a workaround for that, but for now you'd be stuck with an effective lxd remote.

The image must have an alias, as this is the way LXE tries to find it. In the example I used image: ubuntu/jammy in the podspec. As described in the image name format this basically boils down to image: <remotename>/<aliasname>.

Where <remotename> is one of your remotes:

root@node2:~# lxc remote list
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
|      NAME       |                   URL                    |   PROTOCOL    |  AUTH TYPE  | PUBLIC | STATIC | GLOBAL |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| images          | https://images.linuxcontainers.org       | simplestreams | none        | YES    | NO     | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| local (current) | unix://                                  | lxd           | file access | NO     | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| ubuntu          | https://cloud-images.ubuntu.com/releases | simplestreams | none        | YES    | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| ubuntu-daily    | https://cloud-images.ubuntu.com/daily    | simplestreams | none        | YES    | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+

And <aliasname> is as the name suggests the alias name of the image on that remote:

root@node2:~# lxc image alias list ubuntu:jammy
+---------------+--------------+-----------------+-------------+
|     ALIAS     | FINGERPRINT  |      TYPE       | DESCRIPTION |
+---------------+--------------+-----------------+-------------+
| jammy         | b948dd91cd5a | CONTAINER       |             |
+---------------+--------------+-----------------+-------------+
| jammy         | cdb406abc085 | VIRTUAL-MACHINE |             |
+---------------+--------------+-----------------+-------------+
[...]

Be aware that we can't use the : as a separator between <remotename> and <aliasname> in the podspec as you would use with lxc commands. But LXE knows, that the first part before the first / is the remote name and treats it accordingly. <aliasname> can itself contain / just fine. That's why image: <remotename>/<aliasname> is the way to go.

root@node2:~# lxc remote add myremote example.com
Certificate fingerprint: c84c80112f4a339dc7debfe69e2d0fdd9f9a8f9942ec444e1a90361aa440967a
ok (y/n/[fingerprint])? y
Admin password (or token) for myremote: 
Client certificate now trusted by server: myremote

root@node2:~# lxc remote list
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
|      NAME       |                   URL                    |   PROTOCOL    |  AUTH TYPE  | PUBLIC | STATIC | GLOBAL |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| images          | https://images.linuxcontainers.org       | simplestreams | none        | YES    | NO     | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| local (current) | unix://                                  | lxd           | file access | NO     | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| myremote        | https://example.com:8443                 | lxd           | tls         | NO     | NO     | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| ubuntu          | https://cloud-images.ubuntu.com/releases | simplestreams | none        | YES    | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+
| ubuntu-daily    | https://cloud-images.ubuntu.com/daily    | simplestreams | none        | YES    | YES    | NO     |
+-----------------+------------------------------------------+---------------+-------------+--------+--------+--------+

root@node2:~# lxc image copy ubuntu:jammy myremote: --alias myownjammy
Image copied successfully!

root@node2:~# lxc image alias list myremote:
+------------+--------------+-----------+-------------+
|   ALIAS    | FINGERPRINT  |   TYPE    | DESCRIPTION |
+------------+--------------+-----------+-------------+
| myownjammy | b948dd91cd5a | CONTAINER |             |
+------------+--------------+-----------+-------------+

Make sure to restart lxe and thus also kubelet/k3s-agent after the remote config change, as it is read on startup.

You can now write a pod with image: myremote/myownjammy - as long as you make sure the pod lands on your node with this remote set up.

Answer 7 · 2023-11-13T11:09:17.000Z

Thanks a lot, i will test this in my test env. With 1 master on containerd while worker with lxe.

Answer 8 · 2023-11-13T19:29:33.000Z

Remember I'm still having a look on why pods are getting immediately killed. For now k8s v1.23.6 is a known working version to play around.

Answer 9 · 2023-11-14T11:13:47.000Z

Have you worked on k3s single node cluster? I don't know where to create the issue or find you, I have restarted my server and now the coredns, and other pods are in crashloopbackerror, would love to provide details, but won't here.

Answer 10 · 2023-11-14T14:48:50.000Z

I know exactly what you mean and no logs needed. The main issue is there are no k8s service images available for lxd, nobody has yet created them.

Running a single node cluster (or differently said: any master components) in LXE requires deep understanding of kubernetes and extra manual work, you'd install and configure kube-proxy, kube-apiserver, kube-controller-manager, coredns, etc. explicitly on the appropriate host or specific container accordingly by hand. Please refer to this guide how to setup kubernetes the hard way. If you want an easy way, you need a cluster with separate master on docker.

If you have crashloop errors on the node with lxe, please use kubernetes v1.23.6, as this is currently a known working version until the issue is resolved.
If you have crashloop errors on the node with docker/containerd/cri-o, then there's no lxe involved and thus can't help you much - it has to be debugged accordingly.