Gravity compatibility
eddiewang opened this issue · 25 comments
Gravity is a platform that allows us to build K8s clusters declaratively, and is a pretty powerful tool I've started experimenting with as part of my devops toolkit.
It has its own implementation of wireguard (wormhole) that helps create a mesh, similar to kilo, but kilo provides easy peering functionality with kgctl
.
I'd love to start a conversation about how we can make a .yaml deployment for gravity clusters. I'm able to pretty seamlessly get kilo up and running on gravity. the only issue right now is that although the wireguard kilo interface shows up, it appears that kilo/kgctl is never able to pull the nodes and properly apply the wireguard config.
Ah yes cool idea! I've never tried gravity myself but it should certainly be possible to make this work. To get started, can you share the logs from the Kilo pods?
Here's what i'm getting from each pod more or less. This is in flannel compatibility mode (no wormhole installed).
{"caller":"main.go:217","msg":"Starting Kilo network mesh 'dc8fb2dd466667c1efbf5b56e0d1b6bac34858e4'.","ts":"2020-07-01T05:26:29.99229865Z"}
{"caller":"mesh.go:447","component":"kilo","event":"add","level":"info","peer":{"AllowedIPs":[{"IP":"10.79.0.1","Mask":"/////w=="}],"Endpoint":null,"PersistentKeepalive":25,"PresharedKey":null,"PublicKey":"R2lFazE5WGpycEY3U3d1a25sbEcvbCthdTh5YkcrWXZMdWhCMnFjMkF5WT0=","Name":"athena"},"ts":"2020-07-01T05:26:30.235760912Z"}
E0701 14:30:28.507753 1 reflector.go:270] pkg/k8s/backend.go:396: Failed to watch *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?resourceVersion=227644&timeout=9m38s&timeoutSeconds=578&watch=true": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:28.508818 1 reflector.go:270] pkg/k8s/backend.go:147: Failed to watch *v1.Node: Get "https://100.100.0.1/api/v1/nodes?resourceVersion=490913&timeout=7m51s&timeoutSeconds=471&watch=true": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:29.700792 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:29.736711 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:30.701588 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:30.738420 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:31.703045 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:31.740499 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:32.704577 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:32.744346 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:33.706824 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:33.745384 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:34.708073 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:34.747652 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:35.709103 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:35.748564 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:36.709919 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:36.749271 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:37.711122 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:37.750078 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:38.712056 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:38.751328 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:39.713188 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:39.752676 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:40.714502 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:40.754304 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:41.715771 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:41.755116 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:42.716737 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:42.756076 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:43.717600 1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:43.757441 1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
Gravity runs in its default CIDRs at these subnets:
(Optional) CIDR range Kubernetes will be allocating service IPs from. Defaults to 10.100.0.0/16.
(Optional) CIDR range Kubernetes will be allocating node subnets and pod IPs from. Must be a minimum of /16 so Kubernetes is able to allocate /24 to each node. Defaults to 10.244.0.0/16.
Source: https://gravitational.com/gravity/docs/installation/
root@machine:~# wg
interface: kilo0
Running wg shows the interface being created, but no peer settings applied. Running kgtctl
to get the peer config returns:
Error: did not find any valid Kilo nodes in the cluster
[...]
did not find any valid Kilo nodes in the cluster
Interestingly, i don't see a kilo conf file generated anywhere. And on the host machine itself, I don't even see a key file. Instead, I have to do gravity shell
, which takes me inside the "containerized" kubernetes, where I see a path for /var/lib/kilo
, which only contains a key file.
Quick update on this. I got the Failed to list...
message to go away. Now i'm stuck here:
❯ k logs -f kilo-sdbzj -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.37269041Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:02.397981862Z"}
^C
❯ k logs -f kilo-zjhrs -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.993172913Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:03.011767615Z"}
^C
❯ k logs -f kilo-zjhrs -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.993172913Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:03.011767615Z"}
I properly mounted the /var/lib/kilo
path on top of the Gravity cluster so it now appears on the host as well. However, I still do not see a config file being generated. I only see a key
file.
ok that sounds like great progress so far! What did you have to do to get the API access to work? Was it about using the host networking namespace?
As far as the WG config file goes, Kilo only generates that file for the leader of the location. In a one-node cluster, this is obvious :) otherwise, you can force the leader to be a given node with the kilo.squat.ai/leader
annotation and then check the Pod on that specific node
I believe it might have been the wormhole cni or a bad config when I was playing around with the cluster. a clean gravity install with flannel doesn't seem to cause any issues.
You'll noticed I ssh'd into each node of the cluster and checked the kilo folder. no config in any of those. Let me try forcing a leader and see if a config gets generated.
UPDATE: tried setting the leader, recreated the kilo pods, no dice. no config shows, and the pods have the same logs as above.
Ok interesting, in this case perhaps none of the nodes are actually "ready", e.g. none has all of the needed annotations. Can you share the output of kubectl get node -o yaml
for the node labeled as a leader?
Sure! Here is it:
apiVersion: v1
kind: Node
metadata:
annotations:
kilo.squat.ai/leader: "true"
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2020-07-02T05:13:56Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
gravitational.io/advertise-ip: 144.91.83.116
gravitational.io/k8s-role: master
kubernetes.io/arch: amd64
kubernetes.io/hostname: 144.91.83.116
kubernetes.io/os: linux
node-role.kubernetes.io/master: master
role: master
name: 144.91.83.116
resourceVersion: "189547"
selfLink: /api/v1/nodes/144.91.83.116
uid: b9ec8ac8-f131-474c-b3f9-114cad21a81c
spec: {}
status:
addresses:
- address: 144.91.83.116
type: InternalIP
- address: 144.91.83.116
type: Hostname
allocatable:
cpu: 3400m
ephemeral-storage: "1403705377716"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 20553484Ki
pods: "110"
capacity:
cpu: "6"
ephemeral-storage: 1442953720Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 20553484Ki
pods: "110"
conditions:
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:12:59Z"
message: kernel has no deadlock
reason: KernelHasNoDeadlock
status: "False"
type: KernelDeadlock
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:12:59Z"
message: filesystem is not read-only
reason: FilesystemIsNotReadOnly
status: "False"
type: ReadonlyFilesystem
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:12:59Z"
reason: CorruptDockerOverlay2
status: "False"
type: CorruptDockerOverlay2
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:13:01Z"
reason: UnregisterNetDevice
status: "False"
type: FrequentUnregisterNetDevice
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:13:00Z"
reason: FrequentKubeletRestart
status: "False"
type: FrequentKubeletRestart
- lastHeartbeatTime: "2020-07-02T16:05:17Z"
lastTransitionTime: "2020-07-02T05:13:01Z"
reason: FrequentDockerRestart
status: "False"
type: FrequentDockerRestart
- lastHeartbeatTime: "2020-07-02T16:02:09Z"
lastTransitionTime: "2020-07-02T05:13:56Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2020-07-02T16:02:09Z"
lastTransitionTime: "2020-07-02T05:13:56Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2020-07-02T16:02:09Z"
lastTransitionTime: "2020-07-02T05:13:56Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2020-07-02T16:02:09Z"
lastTransitionTime: "2020-07-02T05:13:57Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- leader.telekube.local:5000/openebs/node-disk-manager-amd64@sha256:6edab3e0bbc09f8fd8100ee6da6b77aa6cf10e5771efc4dbf27b289a86b06fd7
- leader.telekube.local:5000/openebs/node-disk-manager-amd64:v0.4.7
sizeBytes: 165782342
- names:
- leader.telekube.local:5000/openebs/node-disk-operator-amd64@sha256:529ca9f80bcf102f97baf3b86a865e9e9de3c6b7abdfe1dd8258da32abc39181
- leader.telekube.local:5000/openebs/node-disk-operator-amd64:v0.4.7
sizeBytes: 165533134
- names:
- leader.telekube.local:5000/gravity-site@sha256:533d4700db15abf210c2f45cd392b2a10744dacb7f5fe28851eaa14ade5dddd7
- leader.telekube.local:5000/gravity-site:7.0.11
sizeBytes: 121344564
- names:
- leader.telekube.local:5000/logrange/collector@sha256:8d852b4dd7d8ded971f408d531da6e0859358d88a0db089886f7bf645ede4e22
- leader.telekube.local:5000/logrange/collector:v0.1.43
sizeBytes: 110511564
- names:
- leader.telekube.local:5000/logrange/forwarder@sha256:1e3dea59ca25d1c771f65e329da1746568826bbbf7e3999fa51ccd80074b3e9d
- leader.telekube.local:5000/logrange/forwarder:v0.1.43
sizeBytes: 110511564
- names:
- leader.telekube.local:5000/prometheus/prometheus@sha256:eabc34a7067d7f2442aca2d22bc774b961f192f7767a58fed73f99e88ea445b7
- leader.telekube.local:5000/prometheus/prometheus:v2.7.2
sizeBytes: 101144312
- names:
- leader.telekube.local:5000/monitoring-mta@sha256:d0d7fadd461a0f01ec2144869d38a9dc4149e2aff9c66041e8178074ed346fca
- leader.telekube.local:5000/monitoring-mta:1.0.0
sizeBytes: 80245931
- names:
- squat/kilo@sha256:5ae1c35fa63eb978ce584cdaa9ad6eff4cf93e6bba732205fdca713b338dba7d
- squat/kilo:latest
sizeBytes: 66209142
- names:
- leader.telekube.local:5000/gravitational/nethealth-dev@sha256:86615c3d2489aa7a1fc820a4ccb4668cae6b3df8ef7d479555d4caf60ff66007
- leader.telekube.local:5000/gravitational/nethealth-dev:7.1.0
sizeBytes: 52671616
- names:
- quay.io/jetstack/cert-manager-controller@sha256:bc3f4db7b6db3967e6d4609aa0b2ed7254b1491aa69feb383f47e6c509516384
- quay.io/jetstack/cert-manager-controller:v0.15.1
sizeBytes: 52432131
- names:
- leader.telekube.local:5000/watcher@sha256:e249dd053943aa43cd10d4b57512489bb850e0d1e023c44d04c668a694f8868d
- leader.telekube.local:5000/watcher:7.0.1
sizeBytes: 43508254
- names:
- leader.telekube.local:5000/prometheus/alertmanager@sha256:fa782673f873d507906176f09ba83c2a8715bbadbd7f24944d6898fd63f136cf
- leader.telekube.local:5000/prometheus/alertmanager:v0.16.2
sizeBytes: 42533012
- names:
- leader.telekube.local:5000/coreos/kube-rbac-proxy@sha256:511e4242642545d61f63a1db8537188290cb158625a75a8aedd11d3a402f972c
- leader.telekube.local:5000/coreos/kube-rbac-proxy:v0.4.1
sizeBytes: 41317870
- names:
- leader.telekube.local:5000/log-adapter@sha256:a6f0482f3c5caa809442a7f51163cfcf28097de4c0738477ea9f7e6affd575ab
- leader.telekube.local:5000/log-adapter:6.0.4
sizeBytes: 40059195
- names:
- leader.telekube.local:5000/coredns/coredns@sha256:5bec1a83dbee7e2c1b531fbc5dc1b041835c00ec249bcf6b165e1d597dd279fa
- leader.telekube.local:5000/coredns/coredns:1.2.6
sizeBytes: 40017418
- names:
- quay.io/jetstack/cert-manager-webhook@sha256:8c07a82d3fdad132ec719084ccd90b4b1abc5515d376d70797ba58d201b30091
- quay.io/jetstack/cert-manager-webhook:v0.15.1
sizeBytes: 39358529
- names:
- leader.telekube.local:5000/gcr.io/google_containers/nettest@sha256:98b0f87c566e8506a0de4234fa0a20f95672d916218cec14c707b1bbdf004b6c
- gcr.io/google_containers/nettest:1.8
- leader.telekube.local:5000/gcr.io/google_containers/nettest:1.8
sizeBytes: 25164808
- names:
- leader.telekube.local:5000/coreos/prometheus-config-reloader@sha256:2a64c4fa65749a1c7f73874f7b2aa22192ca6c14fc5b98ba7a86d064bc6b114c
- leader.telekube.local:5000/coreos/prometheus-config-reloader:v0.29.0
sizeBytes: 21271393
- names:
- leader.telekube.local:5000/prometheus/node-exporter@sha256:42ce76f6c2ade778d066d8d86a7e84c15182dccef96434e1d35b3120541846e0
- leader.telekube.local:5000/prometheus/node-exporter:v0.17.0
sizeBytes: 20982005
- names:
- leader.telekube.local:5000/gravitational/debian-tall@sha256:ffb404b0d8b12b2ccf8dc19908b3a1ef7a8fff348c2c520b091e2deef1d67cac
- leader.telekube.local:5000/gravitational/debian-tall:buster
sizeBytes: 12839230
- names:
- leader.telekube.local:5000/gravitational/debian-tall@sha256:231caf443668ddb66abe6453de3e2ad069c5ddf962a69777a22ddac8c74a934d
- leader.telekube.local:5000/gravitational/debian-tall:stretch
sizeBytes: 11186931
- names:
- leader.telekube.local:5000/gravitational/debian-tall@sha256:b51d1b81c781333bf251493027d8072b5d89d2487f0a293daeb781a6df1e6182
- leader.telekube.local:5000/gravitational/debian-tall:0.0.1
sizeBytes: 11023839
- names:
- leader.telekube.local:5000/coreos/configmap-reload@sha256:c45ae926edea4aed417054f181768f7248d8c57a64c84369a9e909b622332521
- leader.telekube.local:5000/coreos/configmap-reload:v0.0.1
sizeBytes: 4785056
- names:
- leader.telekube.local:5000/gcr.io/google_containers/pause@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b
- gcr.io/google_containers/pause:3.0
- leader.telekube.local:5000/gcr.io/google_containers/pause:3.0
sizeBytes: 746888
nodeInfo:
architecture: amd64
bootID: bf06962b-d23d-44ed-925c-5f33f471e15f
containerRuntimeVersion: docker://18.9.9
kernelVersion: 4.15.0-108-generic
kubeProxyVersion: v1.17.6
kubeletVersion: v1.17.6
machineID: 7265fe765262551a676151a24c02b7b6
operatingSystem: linux
osImage: Debian GNU/Linux 9 (stretch)
systemUUID: 8A8059B4-490F-49A0-BDB7-6106CA65ABE1
Ok yes, clearly Kilo is not successfully discovering the details of the nodes. Is the Kilo container not logging any errors? If not, then can you exec into the pod collect the output of ip -a
?
And can you share all of the configuration flags you are setting on Kilo?
kilo-gravity.yml (i'm playing around with PSP here bc Gravity secures its clusters by default but I don't really know how it works or what im doing here so excuse me if it's terribly wrong lol)
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
name: kilo
namespace: kube-system
spec:
allowedCapabilities:
- NET_ADMIN
- NET_RAW
- CHOWN
fsGroup:
rule: RunAsAny
hostPorts:
- max: 65535
min: 1024
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- '*'
hostNetwork: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kilo
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kilo
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- patch
- watch
- get
- apiGroups:
- kilo.squat.ai
resources:
- peers
verbs:
- list
- update
- watch
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- create
- apiGroups:
- policy
resources:
- podsecuritypolicies
verbs:
- use
resourceNames:
- kilo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kilo
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kilo
subjects:
- kind: ServiceAccount
name: kilo
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kilo
namespace: kube-system
labels:
app.kubernetes.io/name: kilo
spec:
selector:
matchLabels:
app.kubernetes.io/name: kilo
template:
metadata:
labels:
app.kubernetes.io/name: kilo
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
seccomp.security.alpha.kubernetes.io/pod: docker/default
spec:
serviceAccountName: kilo
hostNetwork: true
terminationGracePeriodSeconds: 5
containers:
- name: kilo
image: squat/kilo
args:
- --kubeconfig=/etc/kubernetes/kubeconfig
- --hostname="$(NODE_NAME)"
- --subnet=100.94.0.0/24
- --cni=false
- --compatibility=flannel
- --local=false
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
volumeMounts:
- name: kilo-dir
mountPath: /var/lib/kilo
- name: kubesecrets
mountPath: /var/lib/gravity/secrets
readOnly: true
- name: kubeconfig
mountPath: /etc/kubernetes/kubeconfig
readOnly: true
- name: lib-modules
mountPath: /lib/modules
readOnly: true
- name: xtables-lock
mountPath: /run/xtables.lock
readOnly: false
tolerations:
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- name: kilo-dir
hostPath:
path: /var/lib/kilo
- name: kubesecrets
hostPath:
path: /var/lib/gravity/secrets
- name: kubeconfig
hostPath:
path: /etc/kubernetes/kubectl.kubeconfig
#- name: kubeconfig
#hostPath:
#path: /var/lib/gravity/kubectl.kubeconfig
- name: lib-modules
hostPath:
path: /lib/modules
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
@squat ip -a
isn't a valid command i played around and ip a
sounds about right?
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:50:56:3f:fb:85 brd ff:ff:ff:ff:ff:ff
inet 144.91.83.116/32 scope global eth0
valid_lft forever preferred_lft forever
25: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
link/none
35: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 72:89:37:e7:59:ad brd ff:ff:ff:ff:ff:ff
inet 100.96.36.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
36: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 06:7c:14:69:b1:24 brd ff:ff:ff:ff:ff:ff
37: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 82:a3:ed:f6:81:cd brd ff:ff:ff:ff:ff:ff
inet 100.96.36.1/24 brd 100.96.36.255 scope global cni0
valid_lft forever preferred_lft forever
38: veth10f24bc1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 72:5e:a9:de:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
40: vethdcfb339f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 06:37:da:70:17:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 1
42: veth5b9fd432@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether da:e7:0d:d0:3e:10 brd ff:ff:ff:ff:ff:ff link-netnsid 3
43: vetha1be6e0e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 6a:91:86:4e:9a:48 brd ff:ff:ff:ff:ff:ff link-netnsid 4
44: veth5806d24a@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 2a:3e:d5:52:fe:5c brd ff:ff:ff:ff:ff:ff link-netnsid 2
46: veth9c08955f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 72:9a:49:0e:fa:3d brd ff:ff:ff:ff:ff:ff link-netnsid 6
47: veth36fa2de7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether ca:06:50:bf:8c:f9 brd ff:ff:ff:ff:ff:ff link-netnsid 7
49: veth3ded9a77@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 86:4a:6e:f0:93:5b brd ff:ff:ff:ff:ff:ff link-netnsid 9
50: veth2dc4c0e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 72:9b:4d:34:d0:24 brd ff:ff:ff:ff:ff:ff link-netnsid 8
51: veth05e4235c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether f2:4f:f1:fd:a7:2d brd ff:ff:ff:ff:ff:ff link-netnsid 5
52: veth35d4241c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 7e:65:64:af:ed:eb brd ff:ff:ff:ff:ff:ff link-netnsid 10
No errors. I see the kilo0
interface as expected, but no config applied.
Logs on the leader pod is as follows (after i added a peer):
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T15:50:51.586461583Z"}
{"caller":"mesh.go:447","component":"kilo","event":"add","level":"info","peer":{"AllowedIPs":[{"IP":"10.79.0.1","Mask":"/////w=="}],"Endpoint":null,"PersistentKeepalive":25,"PresharedKey":null,"PublicKey":"R2lFazE5WGpycEY3U3d1a25sbEcvbCthdTh5YkcrWXZMdWhCMnFjMkF5WT0=","Name":"athena"},"ts":"2020-07-02T15:50:51.803735326Z"}
https://github.com/gravitational/wormhole/blob/master/docs/gravity-wormhole.yaml
I took some inspiration there regarding PSP, since wormhole and kilo kind of do the same thing. Maybe you will spot something in that yaml that I missed?
Quick update: trying a cni-enabled config (kilo-grav-cni.yml) got proper annotations working (threw some x's in there to cover info):
Annotations: kilo.squat.ai/endpoint: [144.91.xx.xxx]:51820
kilo.squat.ai/internal-ip: 144.91.xx.xxx/32
kilo.squat.ai/key: EdysQu0GAeDcmLUwwhsQegPVLjj7clcf0xxxxxxDgTw=
kilo.squat.ai/last-seen: 1593797961
kilo.squat.ai/leader: true
kilo.squat.ai/location: contabo
kilo.squat.ai/wireguard-ip:
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
Kilo pods don't show any error, but wg
still doesn't show any config being applied.
Here is the ip a
for the host machine:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:50:56:3f:fe:13 brd ff:ff:ff:ff:ff:ff
inet 161.97.70.159/32 scope global eth0
valid_lft forever preferred_lft forever
44: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether fa:af:0d:da:5b:79 brd ff:ff:ff:ff:ff:ff
inet 100.96.41.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
45: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether f2:49:ff:bd:92:a1 brd ff:ff:ff:ff:ff:ff
46: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 1e:9e:0a:5f:33:9d brd ff:ff:ff:ff:ff:ff
inet 100.96.41.1/24 brd 100.96.41.255 scope global cni0
valid_lft forever preferred_lft forever
47: veth5af95bdd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 86:15:05:1e:20:56 brd ff:ff:ff:ff:ff:ff link-netnsid 0
48: vethbf67ffe7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether de:42:67:f4:0e:be brd ff:ff:ff:ff:ff:ff link-netnsid 1
49: vethe5deee26@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether c2:de:c3:1a:8f:4d brd ff:ff:ff:ff:ff:ff link-netnsid 2
50: vethb3575a22@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 16:c7:9d:9e:35:2e brd ff:ff:ff:ff:ff:ff link-netnsid 3
51: veth6e6c8e74@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 56:1a:2a:9f:47:0b brd ff:ff:ff:ff:ff:ff link-netnsid 4
52: veth893a143c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 5a:45:d1:fb:29:2c brd ff:ff:ff:ff:ff:ff link-netnsid 5
53: veth36d8e5bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 9a:0f:36:8f:d9:0b brd ff:ff:ff:ff:ff:ff link-netnsid 6
54: vethb8c1e2eb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether c2:37:15:a9:b5:70 brd ff:ff:ff:ff:ff:ff link-netnsid 7
55: veth423c7640@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 5a:23:b6:19:1f:cf brd ff:ff:ff:ff:ff:ff link-netnsid 8
56: veth30c16fd1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 2e:72:5b:dd:2c:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 9
57: veth1cba18d9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 12:dc:25:56:53:15 brd ff:ff:ff:ff:ff:ff link-netnsid 10
58: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
link/none
59: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
And here's the ip a
inside the gravity container which is accessed by running gravity shell
:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:50:56:3f:fe:13 brd ff:ff:ff:ff:ff:ff
inet 161.97.70.159/32 scope global eth0
valid_lft forever preferred_lft forever
44: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether fa:af:0d:da:5b:79 brd ff:ff:ff:ff:ff:ff
inet 100.96.41.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
45: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether f2:49:ff:bd:92:a1 brd ff:ff:ff:ff:ff:ff
46: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 1e:9e:0a:5f:33:9d brd ff:ff:ff:ff:ff:ff
inet 100.96.41.1/24 brd 100.96.41.255 scope global cni0
valid_lft forever preferred_lft forever
47: veth5af95bdd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 86:15:05:1e:20:56 brd ff:ff:ff:ff:ff:ff link-netnsid 0
48: vethbf67ffe7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether de:42:67:f4:0e:be brd ff:ff:ff:ff:ff:ff link-netnsid 1
49: vethe5deee26@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether c2:de:c3:1a:8f:4d brd ff:ff:ff:ff:ff:ff link-netnsid 2
50: vethb3575a22@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 16:c7:9d:9e:35:2e brd ff:ff:ff:ff:ff:ff link-netnsid 3
51: veth6e6c8e74@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 56:1a:2a:9f:47:0b brd ff:ff:ff:ff:ff:ff link-netnsid 4
52: veth893a143c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 5a:45:d1:fb:29:2c brd ff:ff:ff:ff:ff:ff link-netnsid 5
53: veth36d8e5bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 9a:0f:36:8f:d9:0b brd ff:ff:ff:ff:ff:ff link-netnsid 6
54: vethb8c1e2eb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether c2:37:15:a9:b5:70 brd ff:ff:ff:ff:ff:ff link-netnsid 7
55: veth423c7640@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 5a:23:b6:19:1f:cf brd ff:ff:ff:ff:ff:ff link-netnsid 8
56: veth30c16fd1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 2e:72:5b:dd:2c:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 9
57: veth1cba18d9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 12:dc:25:56:53:15 brd ff:ff:ff:ff:ff:ff link-netnsid 10
58: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
link/none
59: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
@squat think i'm close to tracing the problem down kubectl get nodes -o=jsonpath="{.items[*]['spec.podCIDR']}"
doesn't return anything, so the podCIDR isn't being captured by kilo.
Related to #53
I confirmed this by setting log-level to all, and seeing this output:
[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:47.244435915Z"}
[kilo-5jdsj] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.159","Port":51820},"Key":"YnZYcmZpTFRQbnNLdHpnbC9MUU9LNUorWm5WWnNQZDI3Mk84Q3NhZ0NTST0=","InternalIP":{"IP":"161.97.70.159","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.159","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:47.247322094Z"}
[kilo-glcnx] {"caller":"mesh.go:375","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.159","Port":51820},"Key":"YnZYcmZpTFRQbnNLdHpnbC9MUU9LNUorWm5WWnNQZDI3Mk84Q3NhZ0NTST0=","InternalIP":{"IP":"161.97.70.159","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.159","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:47.244502618Z"}
[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:48.06791594Z"}
[kilo-glcnx] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.158","Port":51820},"Key":"cXl4QVBYVXBRNkpkTFRWbXJIUFNNN2U3NWswUFcyaWwxdGZ0cEZNSUZ4cz0=","InternalIP":{"IP":"161.97.70.158","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.158","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:48.068020724Z"}
[kilo-5jdsj] {"caller":"mesh.go:471","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2020-07-03T22:12:48.074847349Z"}
[kilo-5jdsj] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:48.086800312Z"}
[kilo-5jdsj] {"caller":"mesh.go:375","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.158","Port":51820},"Key":"cXl4QVBYVXBRNkpkTFRWbXJIUFNNN2U3NWswUFcyaWwxdGZ0cEZNSUZ4cz0=","InternalIP":{"IP":"161.97.70.158","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.158","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:48.086884407Z"}
[kilo-5jdsj] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:51.859849446Z"}
[kilo-5jdsj] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"144.91.83.116","Port":51820},"Key":"RWR5c1F1MEdBZURjbUxVd3doc1FlZ1BWTGpqN2NsY2YwVllKWUM2RGdUdz0=","InternalIP":{"IP":"144.91.83.116","Mask":"/////w=="},"LastSeen":1593814371,"Leader":true,"Location":"contabo","Name":"144.91.83.116","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:51.859927458Z"}
[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:51.863082336Z"}
[kilo-glcnx] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"144.91.83.116","Port":51820},"Key":"RWR5c1F1MEdBZURjbUxVd3doc1FlZ1BWTGpqN2NsY2YwVllKWUM2RGdUdz0=","InternalIP":{"IP":"144.91.83.116","Mask":"/////w=="},"LastSeen":1593814371,"Leader":true,"Location":"contabo","Name":"144.91.83.116","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:51.863199013Z"}
I'm a bit stuck on how to resolve this though. Afaik according to https://gravitational.com/gravity/docs/installation/ the pod network cidr should be set to 10.244.0.0/16
Great work, these are exactly the logs we needed tob see. And this corroborates my suspicion that kilo was not finding ready nodes. This sounds exactly like the same issue we are having with micro k8s, where the cluster is being run with the --allocate-node-cidrs flag disabled #53 (comment)
I'm trying to determine how gravity runs flannel but I can't find this in the documentation. In any case, this problem would indicate that flannel is not using k8s as it's backed, but rather etcd. This is the problem with the microk8s compatibility and means that we can't rely on the node resource to discover the pod subnet for the node. A workaround for compatibility with this flannel mode would be to have something (either an init container or a flannel-specific compatibility shim) read flannel's config file. Doing this via an init container, ie setting the node's pod cidr via a flag on the kg container, would be more generic and could help with other compatibilities in the future.
Maybe this is relevant? https://gravitational.com/gravity/docs/requirements/
Gravity Clusters make high use of Etcd, both for the Kubernetes cluster and for the application's own bookkeeping with respect to e.g. deployed clusters' health and reachability. As a result, it is helpful to have a reliable, performance isolated disk.
To achieve this, by default, Gravity looks for a disk mounted at /var/lib/gravity/planet/etcd. We recommend you mount a dedicated disk there, ext4 formatted with at least 50GiB of free space. A reasonably high performance SSD is preferred. On AWS, we recommend an io1 class EBS volume with at least 1500 provisioned IOPS.
If your Etcd disk is xvdf, you can have the following /etc/fstab entry to make sure it's mounted upon machine startup:
Very bottom. Although it isn't clear if they are using it for flannel. is there a way to check on my cluster?
so bad news. even with the patched annotation, I still can't seem to get kilo to talk nicely with my local peer. kind of at a lost. when connected via wireguard, I am able to ping the internal node ip, the wireguard gateway ip, but none of the pod ips are accessible to me. running a ping on the pod ips seems to drop all packets.
my initial thought was that it's due to container networking weirdness, since Gravity utilizes Planet to containerize Kubernetes, which is a containerd
process. Inside the Planet containers docker-images. I tried both promiscuous and veth networking settings with no luck either.
https://github.com/gravitational/workshop/blob/master/gravity_networking.md#flannel This might be helpful. It explains where the flannel config is in the host machine.
yes exactly, this matches pretty much 100% with what I suspected and what we are seeing in microk8s. It looks like indeed we need to go down one of the routes I described in #62 (comment) if we want compatibility with this flannel operational mode
@squat are there any other settings I'm missing aside from the podCIDR tag? Atm just hoping to get this setup manually, but I still seem to be missing something.
Once I added the podCIDR
tag to the node spec, the wireguard config applies normally and I see my peer listed in wg
. However, I cannot seem to connect to the leader node even though I do see a connection being made. Using ping
i can ping the kilo0 gateway, (10.4.0.1) but none of the pod ips are reachable from my client.
I confirmed the pod ips are reachable when ssh'd into the node though. I can look into an init container solution once I confirm a manual patch works.
@eddiewang / @squat Sorry, someone just drew my attention to this issue, I scanned through it relatively quickly.
If my quick read is correct, I think what you're looking for is the networkXXX hooks within the application manifest (Edit: in gravity). We don't really draw attention to these hooks in the docs because we don't really offer support for this configuration and network troubleshooting takes up alot of support load. There is a networkInstall hook (install time job), and then Update/Rollback hooks for upgrade/rollback operations.
When those hooks are enabled, we disable flannel, and enable the kube-controller ipam to allocate CIDR's to the nodes. Otherwise, it's up to the hook to configure the networking when called (our hook system is based on executing kubernetes jobs).
Gravity builds the hook for wormhole in code, but if it helps the hook code is here: https://github.com/gravitational/gravity/blob/master/lib/app/service/vendor.go#L603-L645
thanks @knisbet for the helpful response. I wasn't able to get it to work and gave up on it, although this insight might make me dig back into this, as for a development cluster, gravity + kilo would be pretty perfect.
want to quickly clarify my understanding of your comment. I need to add specific tags in the gravity build yaml for a networkInstall hook in order to disable the default flannel install... and then apply the kilo yaml..? and the reason we do that is because we want the kilo controller to properly allocate the CIDRs?
or is the reason bc wormhole is always installed on gravity clusters, and we want to disable it here in order to get kilo working? from my understanding I didn't have wormhole enabled while attempting to get kilo and gravity working together.
https://github.com/squat/kilo/blob/master/manifests/kilo-k3s-flannel.yaml is the working config I use for K3S clusters, which have flannel installed by default.
I'd also be interested in contributing a PR for a gravity compatible config into the kilo repo if we're able to get this up and running :)