KubeEdge edgecore supports CNI Cilium?
fujitatomoya opened this issue · 40 comments
What happened and what you expected to happen:
3rd party CNI implementation support is listed as roadmap here, Roadmap 2021-H1 but we are having several problems once enabling the cilium on edgecore.
How to reproduce it (as minimally and precisely as possible):
after KubeEdge is enabled, install cilium.
Anything else we need to know?:
The followings are related issues aligned with this question.
Let me elaborate our team's investigation on using cilium
with KubeEdge
so far.
Here is our process of setting up the cluster (cloudcore
and edgecore
in use are including some workarounds for problems encountered, I'll show you the details also.):
- Create the cluster on cloud node with
kubeadm init --node-name pc-windrow
. Copy/etc/kubernetes/admin.conf
to~/.kube/config
. - Untaint the master node with
kubectl taint node pc-windrow node-role.kubernetes.io/master:NoSchedule-
. - Install
cilium
into the cluster withcilium --encryption wireguard install
. - Add
node-role.kubernetes.io/control-plane: ''
and tolerations, so thatcilium-operator
would always be deployed on cloud node.kubectl edit deployment -n kube-system cilium-operator
. - Duplicate
cilium
'sdaemonsets
, one is still for nodes usingkubelet
, including master node. The other is adjusted for nodes usingedgecore
,--k8s-api-server=127.0.0.1:10550
is added tocilium-agent
's argument. Twodaemonsets
have different names and the argument mentioned above, all the other things remain the same. - Initialize
KubeEdge
on master node withsudo keadm init --advertise-address="xxx.xxx.xxx.xxx" --profile version=v1.12.1 --kube-config=/home/windrow/.kube/config
. - Add
cloudcore
inkubectl edit clusterRolebinding cilium
to give it permission of accessingcilium
's resources. - Enable
dynamiccontroller
by editingconfigmap
withkubectl edit cm -n kubeedge cloudcore
, and deleting the old podkubectl delete pod -n kubeedge cloudcore-5876c76687-52xr5
. - Get token and join the edge node.
sudo keadm gettoken
,sudo keadm join --cloudcore-ipport=xxx.xxx.xxx.xxx:10000 --kubeedge-version=v1.12.1 --cgroupdriver=systemd --token ********
- Edit
/etc/kubeedge/config/edgecore.yaml
to enableservicebus
andmetamanager
. - Replace
/usr/local/bin/edgecore
withedgecore
with workarounds built from source locally.sudo systemctl daemon-reload
andsudo systemctl restart edgecore
- Create a fake secret:
kubectl create secret generic cilium-clustermesh -n kube-system
. It's a workaround for issue 4817.
Besides issue 4817 and issue 4819 mentioned by Fujita-san, we also found that edgecore
does not handle GET /version
and Get /healthz
, so I hard coded a 200 OK
for these requests in edgecore
.
The current state is, cilium
applications are running normally. According to their logs, cilium
on different nodes can communicate with each other and decide CIDR for each node. But on edge nodes, new pods deployed are using 169.254.32.0/24
addresses, which are default address pool of docker
. So I think CNI plugin is not taking any effect.
10. Edit
/etc/kubeedge/config/edgecore.yaml
to enableservicebus
andmetamanager
.
In this step, networkPluginName: cni
should also be added into the file, under edged
.
networkPluginName: cni
does not take effect in v1.12.1. v1.12.2 fixed it, you can try it.
networkPluginName: cni
does not take effect in v1.12.1. v1.12.2 fixed it, you can try it.
I'm based on the head of origin/release-1.12
: commit 3de324ec27e748a70a01e8520aec559516a5854b (HEAD -> release-1.12, tag: v1.12.2, origin/release-1.12)
. The fix should be included? Let me do some verification.
@Shelley-BaoYue Thanks for your advice!
By the way, is there a way to prevent keadm
from pulling and copying new edgecore
to /usr/local/bin
when keadm join
?
Still the problem related to namespace
. As you can see in the log, the resource cilium
wants to fetch is a CRD in "default" namespace, but message_dispatcher.go
thinks there is a namespace whose name is "null".
E0713 13:06:54.931545 1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/24958"
E0713 13:06:54.931612 1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.15fe34e6-b5d5-4ec3-bc1a-47acffdd146b" resourceNamespace="null" resourceName="24958"
E0713 13:06:54.989269 1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/45429"
E0713 13:06:54.989361 1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.1b1c443b-ac3b-4106-9854-bc758675bf24" resourceNamespace="null" resourceName="45429"
E0713 13:06:55.046585 1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/51110"
E0713 13:06:55.046644 1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.88fa747d-d459-4698-9572-78ee4d33a380" resourceNamespace="null" resourceName="51110"
My workaround to above problem is (please ignore the logger):
diff --git a/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go b/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
index f692563eb..297bb1f9a 100644
--- a/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
+++ b/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
@@ -265,6 +265,16 @@ func (md *messageDispatcher) enqueueAckMessage(nodeID string, msg *beehivemodel.
return
}
+ resourceType, _ := messagelayer.GetResourceType(*msg)
+ if resourceType == beehivemodel.ResourceTypeNamespace {
+ resourceNamespace = resourceName
+ }
+ klog.Errorf("resourceNamespace original: %s", resourceNamespace)
+ if resourceNamespace == "null" {
+ resourceNamespace = "default"
+ }
+ klog.Errorf("resourceNamespace: %s", resourceNamespace)
+
objectSyncName := synccontroller.BuildObjectSyncName(nodeID, resourceUID)
objectSync, err := md.objectSyncLister.ObjectSyncs(resourceNamespace).Get(objectSyncName)
@@ -295,6 +305,7 @@ func (md *messageDispatcher) enqueueAckMessage(nodeID string, msg *beehivemodel.
ObjectSyncs(resourceNamespace).
Create(context.Background(), objectSync, metav1.CreateOptions{})
if err != nil {
+ klog.ErrorS(err, "MessageRoute.Resource", "resource", msg.GetResource())
klog.ErrorS(err, "Failed to create objectSync",
"objectSyncName", objectSyncName,
"resourceNamespace", resourceNamespace,
diff --git a/cloud/pkg/cloudhub/session/node_session.go b/cloud/pkg/cloudhub/session/node_session.go
index 045410185..767c8b884 100644
--- a/cloud/pkg/cloudhub/session/node_session.go
+++ b/cloud/pkg/cloudhub/session/node_session.go
@@ -386,6 +386,15 @@ func (ns *NodeSession) saveSuccessPoint(msg *beehivemodel.Message) {
return
}
+ if resourceType == beehivemodel.ResourceTypeNamespace {
+ resourceNamespace = resourceName
+ }
+ klog.Errorf("resourceNamespace original: %s", resourceNamespace)
+ if resourceNamespace == "null" {
+ resourceNamespace = "default"
+ }
+ klog.Errorf("resourceNamespace: %s", resourceNamespace)
+
objectSyncName := synccontroller.BuildObjectSyncName(ns.nodeID, resourceUID)
if msg.GetOperation() == beehivemodel.DeleteOperation {
However, with this workaround, the connection between cloudcore and edgecore becomes unstable.
Cloud side log:
W0713 14:15:06.532295 1 upstream.go:217] parse message: 68988ccb-ccb8-4461-82b2-6684a901e7d1 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0713 14:15:06.532335 1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
I0713 14:15:06.532518 1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0713 14:15:06.534728 1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/networking.k8s.io/v1/networkpolicies/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message 8450c197-d20b-4bc4-b638-8071e85612c5)
I0713 14:15:06.534867 1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/networking.k8s.io/v1/networkpolicies/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message 8450c197-d20b-4bc4-b638-8071e85612c5))
I0713 14:15:06.535380 1 upstream.go:89] Dispatch message: 2a585f02-cbb8-4d82-ba40-bd6c163294bf
I0713 14:15:06.535424 1 upstream.go:96] Message: 2a585f02-cbb8-4d82-ba40-bd6c163294bf, resource type is: membership/detail
I0713 14:15:06.535469 1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumnodes/null/pc-k8s-192-168-0-121;Verb=update;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message fcac3af0-ff3a-4498-ae48-c998882e8bf6)
E0713 14:15:06.561567 1 application.go:451] [metaserver/applicationCenter]failed to process Application((NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumnodes/null/pc-k8s-192-168-0-121;Verb=update;Status=Rejected;Reason=failed to access cloud Application center: timeout to get response for message fcac3af0-ff3a-4498-ae48-c998882e8bf6)), Operation cannot be fulfilled on ciliumnodes.cilium.io "pc-k8s-192-168-0-121": the object has been modified; please apply your changes to the latest version and try again
I0713 14:15:06.561795 1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/core/v1/namespaces/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message efae166e-d2bd-4986-965a-59776584cf40)
I0713 14:15:06.562148 1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/core/v1/namespaces/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message efae166e-d2bd-4986-965a-59776584cf40))
W0713 14:15:06.864566 1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
I0713 14:15:06.864628 1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0713 14:15:06.864670 1 ws.go:107] failed to read message, error: read tcp xxx.xxx.xxx.xxx:10000->xxx.xxx.yyy.yyy:33914: use of closed network connection
W0713 14:15:06.864685 1 upstream.go:217] parse message: 0bde0350-4994-436d-9a2a-ddc89a7e428a resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
E0713 14:15:06.864696 1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0713 14:15:06.864704 1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
W0713 14:15:06.866193 1 message_dispatcher.go:416] message pool for edge node pc-k8s-192-168-0-121 not found and created now
Edge side log:
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.186916 49983 websocket.go:51] Websocket start to connect Access
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196177 49983 ws.go:46] dial wss://xxx.xxx.xxx.xxx:10000/e632aba927ea4ac2b575ec1603d56f10/pc-k8s-192-168-0-121/events successfully
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196258 49983 websocket.go:93] Websocket connect to cloud access successful
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196370 49983 process.go:461] node connection event occur: cloud_connected
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196410 49983 process.go:461] node connection event occur: cloud_connected
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.196425 49983 eventbus.go:168] Action not found
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196474 49983 process.go:299] DeviceTwin receive msg
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196531 49983 process.go:68] Send msg to the CommModule module in twin
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.298635 49983 context_channel.go:164] Get bad anonName:d12873a3-efe7-46fa-a3bb-2703afb9cb58 when sendresp message, do nothing
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.363680 49983 trace.go:205] Trace[1940341198]: "Update" url:/apis/cilium.io/v2/ciliumnodes/pc-k8s-192-168-0-121,user-agent:cilium-agent/1.12.0 9447cd1 2022-07-19T12:22:00+02:00 go version go1.18.4 linux/amd64,audit-id:,client:127.0.0.1,accept:application/json, */*,protocol:HTTP/1.1 (13-Jul-2023 14:09:19.505) (total time: 3858ms):
7月 13 14:09:23 edge121 edgecore[49983]: Trace[1940341198]: [3.858596122s] [3.858596122s] END
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.556967 49983 ws.go:107] failed to read message, error: websocket: close 1006 (abnormal closure): unexpected EOF
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.598141 49983 process.go:161] failed to send message to cloud: failed to send message, error: use of closed network connection
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.598264 49983 process.go:112] websocket read error: the fifo is broken
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.598293 49983 edgehub.go:126] connection is broken, will reconnect after 30s
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598322 49983 process.go:299] DeviceTwin receive msg
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.598350 49983 eventbus.go:168] Action not found
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598369 49983 process.go:68] Send msg to the CommModule module in twin
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598514 49983 process.go:461] node connection event occur: cloud_disconnected
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598559 49983 process.go:461] node connection event occur: cloud_disconnected
7月 13 14:09:38 edge121 edgecore[49983]: E0713 14:09:38.197144 49983 process.go:183] websocket write error: failed to send message, error: use of closed network connection
7月 13 14:09:43 edge121 edgecore[49983]: E0713 14:09:43.367243 49983 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"failed to access cloud Application center: timeout to get response for message 726de9ad-3070-47f7-b0d5-ca14a3567952"}: failed to access cloud Application center: timeout to get response for message 726de9ad-3070-47f7-b0d5-ca14a3567952
7月 13 14:09:43 edge121 edgecore[49983]: I0713 14:09:43.367458 49983 trace.go:205] Trace[2121639627]: "Update" url:/apis/cilium.io/v2/ciliumnodes/pc-k8s-192-168-0-121,user-agent:cilium-agent/1.12.0 9447cd1 2022-07-19T12:22:00+02:00 go version go1.18.4 linux/amd64,audit-id:,client:127.0.0.1,accept:application/json, */*,protocol:HTTP/1.1 (13-Jul-2023 14:09:33.365) (total time: 10001ms):
7月 13 14:09:43 edge121 edgecore[49983]: Trace[2121639627]: [10.001876031s] [10.001876031s] END
Also, please check this comment in issue 4819: #4819 (comment).
Although that ticket is closed, the MR related to it does not cover the problem mentioned in the ticket. We want to reopen that ticket. Please review the additional patch shown in that comment.
However, with this workaround, the connection between cloudcore and edgecore becomes unstable.
Maybe the connection unstability is not caused by the workaround, but the restart of edgecore
.
W0714 11:15:42.218567 1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now
E0714 11:15:47.212098 1 node_session.go:194] syncAckMessage err: use of closed network connection
W0714 11:16:12.244028 1 upstream.go:217] parse message: 2a148267-1355-46c5-b482-2d699c3c8b3a resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0714 11:16:12.244105 1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
I0714 11:16:12.244234 1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0714 11:16:12.246083 1 upstream.go:89] Dispatch message: 43dc2709-fdaa-4953-9154-c97ee8edae3c
I0714 11:16:12.246125 1 upstream.go:96] Message: 43dc2709-fdaa-4953-9154-c97ee8edae3c, resource type is: membership/detail
W0714 11:16:13.373796 1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
E0714 11:16:13.373865 1 node_session.go:194] syncAckMessage err: AckMessageQueue for node pc-k8s-192-168-0-121 has shutdown
E0714 11:16:13.373875 1 ws.go:107] failed to read message, error: read tcp 43.82.125.156:10000->43.82.111.25:38884: use of closed network connection
I0714 11:16:13.373883 1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0714 11:16:13.373893 1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0714 11:16:13.373900 1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
W0714 11:16:13.373929 1 upstream.go:217] parse message: 2e41c8de-d5c7-40db-816c-95e50f5631ed resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
W0714 11:16:13.442387 1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now
I0714 11:16:47.065018 1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
W0714 11:16:47.065047 1 upstream.go:217] parse message: 6c7356f1-8985-4294-a438-669645547d28 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0714 11:16:47.065166 1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0714 11:16:47.067045 1 upstream.go:89] Dispatch message: e0301055-a4dc-4cd5-b8fe-48ba4a8453f3
I0714 11:16:47.067096 1 upstream.go:96] Message: e0301055-a4dc-4cd5-b8fe-48ba4a8453f3, resource type is: membership/detail
I0714 11:16:47.067223 1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumendpoints/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message 2321f6e0-3cba-45cf-9cd6-c0ac084fa339)
I0714 11:16:47.068370 1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumendpoints/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message 2321f6e0-3cba-45cf-9cd6-c0ac084fa339))
W0714 11:16:50.038050 1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
I0714 11:16:50.038080 1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0714 11:16:50.038120 1 ws.go:107] failed to read message, error: read tcp 43.82.125.156:10000->43.82.111.25:38888: use of closed network connection
W0714 11:16:50.038124 1 upstream.go:217] parse message: d09bb98e-4213-41ba-89f5-e2c5f0d14229 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
E0714 11:16:50.038135 1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0714 11:16:50.038146 1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
E0714 11:16:55.037746 1 node_session.go:194] syncAckMessage err: use of closed network connection
W0714 11:16:55.180870 1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now
Looks like cloudhub
thinks one session is already closed, so it close the websocket connection to that node (in SendAckMessage
in node_session.go
). Although the session is created again because of new message from edgecore
, the EOF error cause by the shutdown make edgecore
abandon that session. And then cloudhub
thinks the session dead again, and wants to terminate it (SendNoAckMessage
). This logic loops again and again...
Let me update the current situation (2023.7.27):
- With a number of workarounds, and a sequence of steps with strict order, we can make edge node managed by
KubeEdge
work withCilium
. No obvious problem for creating pods and pinging IP address between pods on different nodes. - Still some error prints in
edgecore
's journal for failure to getciliumidentity
, but no consequence is seen so far. - In-cluster DNS on edge node needs EdgeMesh. However, on my setup, it works only for two edgecore-managed nodes or on edgecore-managed node locally, but not between edgecore-manage node and kubelet-managed node.
Current steps for setting up the cluster:
- Prepare
cloudcore
's image andedgecore
's binary.- Repository: https://github.com/kubeedge/kubeedge.git
- Branch: origin/release-1.12
- Baseline Commit: 3de324e (v1.12.2)
- Patch: 20230727_kubeedge_cilium.txt
- Create the cluster on cloud node.
sudo kubeadm init --node-name pc-windrow mkdir ~/.kube sudo cp /etc/kubernetes/admin.conf ~/.kube/config sudo chown $(id -u):$(id -g) ~/.kube/config sudo mkdir /root/.kube sudo cp /etc/kubernetes/admin.conf /root/.kube/config
- Untaint the cloud node.
kubectl taint node pc-windrow node-role.kubernetes.io/master:NoSchedule-
- Install
Cilium
.cilium --encryption wireguard install
- Make
cilium-operator
always be deployed on cloud node by addingnode-role.kubernetes.io/control-plane: ''
and tolerations. cilium-operator.txtkubectl apply -f cilium-operator.yaml
- Duplicate
cilium
'sdaemonsets
, one is still for nodes usingkubelet
, including master node. The other is adjusted for nodes usingedgecore
,--k8s-api-server=127.0.0.1:10550
is added tocilium-agent
's argument. Twodaemonsets
have different names and the argument mentioned above, all the other things remain the same. cilium-kubelet.txt cilium-kubeedge.txtkubectl apply -f cilium-kubelet.yaml kubectl apply -f cilium-kubeedge.yaml
- Initialize
KubeEdge
on cloud node.sudo keadm init --advertise-address="xxx.xxx.xxx.xxx" --profile version=v1.12.2 --kube-config=/home/windrow/.kube/config
- Give
cloudcore
access ofcilium
's resources by editingclusterRolebinding
. cilium-clusterrole.txt cilium-clusterrolebinding.txtkubectl apply -f cilium-clusterrole.yaml kubectl apply -f cilium-clusterrolebinding.yaml
- Enable
dynamiccontroller
by editingconfigmap
. kubeedge-configmap-cloudcore.txtkubectl apply -f kubeedge-configmap-cloudcore.yaml kubectl delete pod -n kubeedge -l kubeedge=cloudcore
- Get token after
cloudcore
has restarted.sudo keadm gettoken
- Join edge node. (Remember to stop and remove
mosquitto
's container before joining, it occupies port 1883.)sudo keadm join --cloudcore=xxx.xxx.xxx.xxx:10000 --kubeedge-version=v1.12.2 --cgroupdriver=systemd --edgenode-name pc-k8s-192-168-0-121 --token e99dced4689534f29b1502754c7fa63c90834650285aa56e0079d7b794b1676b.eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2OTA1MTk4Nzh9.hMA-6WCnOVmEXwGgGCsE7Dc8sufUD_CJ6o-pIrdPl8c
- Edit
/etc/kubeedge/config/edgecore.yaml
. edgecore.txtdiff --git a/edgecore.yaml b/edgecore.yaml index 82e5ece..c011e78 100644 --- a/edgecore.yaml +++ b/edgecore.yaml @@ -55,6 +55,7 @@ modules: masterServiceNamespace: default maxPerPodContainerCount: 1 minimumGCAge: 0s + networkPluginName: cni networkPluginMTU: 1500 nonMasqueradeCidr: 10.0.0.0/8 podSandboxImage: kubeedge/pause:3.1 @@ -69,6 +70,9 @@ modules: <-- This block is for EdgeMesh, we can neglect it if not using EdgeMesh. address: 127.0.0.1 cgroupDriver: systemd cgroupsPerQOS: true + clusterDNS: + - 169.254.96.16 + clusterDomain: cluster.local configMapAndSecretChangeDetectionStrategy: Get containerLogMaxFiles: 5 containerLogMaxSize: 10Mi @@ -157,14 +161,14 @@ modules: contextSendModule: websocket enable: true metaServer: - enable: false + enable: true server: 127.0.0.1:10550 tlsCaFile: /etc/kubeedge/ca/rootCA.crt tlsCertFile: /etc/kubeedge/certs/server.crt tlsPrivateKeyFile: /etc/kubeedge/certs/server.key remoteQueryTimeout: 60 serviceBus: - enable: false + enable: true port: 9060 server: 127.0.0.1 timeout: 60
- Replace
edgecore
binary. (The binary at/usr/local/bin/edgecore
is re-downloaded each timekeadm join ...
is used.)sudo systemctl stop edgecore sudo cp edgecore /usr/local/bin/edgecore sudo systemctl daemon-reload sudo systemctl start edgecore
- Create a fake secret. (A workaround for issue 4817.)
kubectl create secret generic cilium-clustermesh -n kube-system
- Deploy EdgeMesh. (To check further, not required.)
Current code changes in kubeedge
:
- Error when get
namespace
throughmetaserver
. - Error when get
version
andhealthz
throughmetaserver
. - Error when get
node
throughmetaserver
.
version api through metaserver will be supported in next version (v1.15) @ZhengXinwei-F, and you're welcome to contribute your code changes.
@Shelley-BaoYue Thanks for the information! These code changes are mostly ugly workarounds. I'm verifying feasibility so far.
Is getting a CRD, such as ciliumendpoint
, through metaserver
to be supported some day? Currently, it is neglected in switch
function of syncPod
in edged.go
.
we will consider it as a feature, it may take a while to achieve it. And welcome to participate in the discussion at the community meeting and describe your requirements in detail.
@Shelley-BaoYue Thanks for the information! These code changes are mostly ugly workarounds. I'm verifying feasibility so far.
Is getting a CRD, such as
ciliumendpoint
, throughmetaserver
to be supported some day? Currently, it is neglected inswitch
function ofsyncPod
inedged.go
.
I agree with @Shelley-BaoYue, we will consider it as a feature.
And here is the current version's interim solution for this issue.
In terms of code implementation, the present version of metaServer enables getting CR.
But i discovered that the request fails due to cloudcore's lack of permissions. such as the following:
PS. My linux kernel version does not support cilium; in this case, we will use calico as an example.
curl 127.0.0.1:10550/apis/crd.projectcalico.org/v1/ipamblocks
{
"apiVersion": "crd.projectcalico.org/v1",
"items": [],
"kind": "IpamblockList"
}
cloudcore logs:
E0731 17:16:38.321093 1 application.go:61] [metaserver/applicationCenter]failed to process Application((NodeName=node1;Key=/crd.projectcalico.org/v1/ipamblocks/null/null;Verb=list;Status=Rejected;Reason=get current list error: ipamblocks.crd.projectcalico.org is forbidden: User "system:serviceaccount:kubeedge:cloudcore" cannot list resource "ipamblocks" in API group "crd.projectcalico.org" at the cluster scope)), get current list error: ipamblocks.crd.projectcalico.org is forbidden: User "system:serviceaccount:kubeedge:cloudcore" cannot list resource "ipamblocks" in API group "crd.projectcalico.org" at the cluster scope
To grant cloudcore cr permission, we can update the clusterrolebing called cloudcore as follows:
kubectl edit clusterrole cloudcore
(ps, you can also create a new clusterrole and clusterrolebinding)
- apiGroups:
- crd.projectcalico.org
resources:
- globalfelixconfigs
- felixconfigurations
- bgppeers
- globalbgpconfigs
- bgpconfigurations
- ippools
- ipreservations
- ipamblocks
- globalnetworkpolicies
- globalnetworksets
- networkpolicies
- networksets
- clusterinformations
- hostendpoints
- blockaffinities
- caliconodestatuses
verbs:
- get
- list
- watch
then
curl 127.0.0.1:10550/apis/crd.projectcalico.org/v1/ipamblocks
{
"apiVersion": "crd.projectcalico.org/v1",
"items": [
{
"apiVersion": "crd.projectcalico.org/v1",
"kind": "IPAMBlock",
"metadata": {
"annotations": {
"projectcalico.org/metadata": "{\"creationTimestamp\":null}"
},
"creationTimestamp": "2023-07-29T02:50:32Z",
"generation": 13,
"managedFields": [
{
"apiVersion": "crd.projectcalico.org/v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:annotations": {
.......
kubectl edit clusterrole cloudcore
(ps, you can also create a new clusterrole and clusterrolebinding)
Yes, I also edit the clusterrole
and clusterrolebinding
in step 8 of #4844 (comment). Let me have a double check.
kubectl edit clusterrole cloudcore
(ps, you can also create a new clusterrole and clusterrolebinding)Yes, I also edit the
clusterrole
andclusterrolebinding
in step 8 of #4844 (comment). Let me have a double check.
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles
Maybe the "Aggregated ClusterRoles" function of K8s can help you handle this issue. We'll also take this approach into account for future demand development.
like:
cloudcore clusterrole:
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.k8s.io/aggregate-to-cloudcore: "true"
cilium and calico, and others:
metadata:
labels:
rbac.authorization.k8s.io/aggregate-to-cloudcore: "true"
Hi @fisherxu , excuse me, is there somewhere I can find a document for submodules such as dynamiccontroller
?
#4904 this patch should also relate to this issue cc @ZhengXinwei-F
@Windrow14 Could you please open a PR to submit your patch?
@Windrow14 Could you please open a PR to submit your patch?
@fisherxu Let me have a check with release 1.14 first, it seems some problems are already fixed there. Then I'll create a PR for remaining things.
#4589 I encountered the problem described in this ticket... @fisherxu @Shelley-BaoYue
Error when get version and healthz through metaserver.
@Windrow14 we surely want to add the support for /healthz
aligned with #4904?
Error when get version and healthz through metaserver.
@Windrow14 we surely want to add the support for
/healthz
aligned with #4904?
You can add the /healthz transparent transmission capability after the merging #4904 is complete.
Error when get version and healthz through metaserver.
@Windrow14 we surely want to add the support for
/healthz
aligned with #4904?
@fujitatomoya PR #4904 has been merged. Based on this PR, you can support more URIs such as '/healthz'.
P.S. Would you be interested in creating some E2E tests for metaServer's non-resources URIs? : )
@ZhengXinwei-F thanks for the contribution.
CC: @Windrow14
Based on this PR, you can support more URIs such as '/healthz'.
we would add it need-to-add
basis for that.
#4589 I encountered the problem described in this ticket... @fisherxu @Shelley-BaoYue
// Sometimes, we need guess kind according to resource:
// 1. In most cases, is like pods to Pod,
// 2. In some unusual cases, requires special treatment like endpoints to Endpoints
func UnsafeResourceToKind(r string) string {
if len(r) == 0 {
return r
}
unusualResourceToKind := map[string]string{
"endpoints": "Endpoints",
"endpointslices": "EndpointSlice",
"nodes": "Node",
"namespaces": "Namespace",
"services": "Service",
"podstatus": "PodStatus",
"nodestatus": "NodeStatus",
"customresourcedefinitions": "CustomResourceDefinition",
"customresourcedefinitionlist": "CustomResourceDefinitionList",
}
if v, isUnusual := unusualResourceToKind[r]; isUnusual {
return v
}
caser := cases.Title(language.Und)
k := caser.String(r)
switch {
case strings.HasSuffix(k, "ies"):
return strings.TrimSuffix(k, "ies") + "y"
case strings.HasSuffix(k, "es"):
return strings.TrimSuffix(k, "es")
case strings.HasSuffix(k, "s"):
return strings.TrimSuffix(k, "s")
}
return k
}
This function used in func (f *Factory) Create(req *request.RequestInfo) http.Handler
may turns CRDs to unrecognized words, for example, ciliumnodes
turns to ciliumnod
. Is there a why to append unusualResourceToKind
dynamically?
edgecore-panic-when-receive-post.log cilium-agent-post.pcap.txt
Empty options.FieldValidation
makes validationDirective
equals to metav1.FieldValidationWarn
, leads to decodeSerializer = s.StrictSerializer
, which is nil
.
nil
decodeSerializer
makes nil
decoder
, which panics when called.
Fix to the panic issue above:
diff --git a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
index 62c9e83ac..e06570de3 100644
--- a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
+++ b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
@@ -31,6 +31,7 @@ func (f WithoutConversionCodecFactory) SupportedMediaTypes() []runtime.Serialize
EncodesAsText: true,
Serializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),
PrettySerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: true}),
+ StrictSerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Strict: true}),
StreamSerializer: &runtime.StreamSerializerInfo{
EncodesAsText: true,
Serializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),
Patch for UnsafeResourceToKind
, but it seems still working without this patch, though if you print Kind
in edgecore
, you will see Ciliumnod
.
diff --git a/pkg/metaserver/util/util.go b/pkg/metaserver/util/util.go
index f549faf0d..266e1ed82 100644
--- a/pkg/metaserver/util/util.go
+++ b/pkg/metaserver/util/util.go
@@ -57,6 +57,7 @@ func UnsafeResourceToKind(r string) string {
"nodestatus": "NodeStatus",
"customresourcedefinitions": "CustomResourceDefinition",
"customresourcedefinitionlist": "CustomResourceDefinitionList",
+ "ciliumnodes": "CiliumNode",
}
if v, isUnusual := unusualResourceToKind[r]; isUnusual {
return v
@@ -84,6 +85,7 @@ func UnsafeKindToResource(k string) string {
"NodeStatus": "nodestatus",
"CustomResourceDefinition": "customresourcedefinitions",
"CustomResourceDefinitionList": "customresourcedefinitionlist",
+ "CiliumNode": "ciliumnodes",
}
if v, isUnusual := unusualKindToResource[k]; isUnusual {
return v
Fix to the panic issue above:
diff --git a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go index 62c9e83ac..e06570de3 100644 --- a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go +++ b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go @@ -31,6 +31,7 @@ func (f WithoutConversionCodecFactory) SupportedMediaTypes() []runtime.Serialize EncodesAsText: true, Serializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}), PrettySerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: true}), + StrictSerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Strict: true}), StreamSerializer: &runtime.StreamSerializerInfo{ EncodesAsText: true, Serializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),
Thank you for your contribution. Could you please submit a PR to resolve this issue?
Patch for
UnsafeResourceToKind
, but it seems still working without this patch, though if you printKind
inedgecore
, you will seeCiliumnod
.diff --git a/pkg/metaserver/util/util.go b/pkg/metaserver/util/util.go index f549faf0d..266e1ed82 100644 --- a/pkg/metaserver/util/util.go +++ b/pkg/metaserver/util/util.go @@ -57,6 +57,7 @@ func UnsafeResourceToKind(r string) string { "nodestatus": "NodeStatus", "customresourcedefinitions": "CustomResourceDefinition", "customresourcedefinitionlist": "CustomResourceDefinitionList", + "ciliumnodes": "CiliumNode", } if v, isUnusual := unusualResourceToKind[r]; isUnusual { return v @@ -84,6 +85,7 @@ func UnsafeKindToResource(k string) string { "NodeStatus": "nodestatus", "CustomResourceDefinition": "customresourcedefinitions", "CustomResourceDefinitionList": "customresourcedefinitionlist", + "CiliumNode": "ciliumnodes", } if v, isUnusual := unusualKindToResource[k]; isUnusual { return v
Writing code in KubeEdge that adapts to cilium or other cni does not appear to be very elegant. Perhaps we could solve this issue using configuration or other dynamic ways.
One more thing, if we add support for healthz
, readyz
and livez
like below, it is fine to handle 127.0.0.1:10550/readyz
. But when it comes to 127.0.0.1:10550/readyz?verbose
, edgecore
is still handling it like 127.0.0.1:10550/readyz
. This behavior is different from k8s document's description: https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health. What's your opinion on this?
diff --git a/pkg/util/pass-through/pass_through.go b/pkg/util/pass-through/pass_through.go
index 225512546..c2e2dbd1b 100644
--- a/pkg/util/pass-through/pass_through.go
+++ b/pkg/util/pass-through/pass_through.go
@@ -4,10 +4,16 @@ type passRequest string
const (
versionRequest passRequest = "/version::get"
+ healthRequest passRequest = "/healthz::get"
+ liveRequest passRequest = "/livez::get"
+ readyRequest passRequest = "/readyz::get"
)
var passThroughMap = map[passRequest]bool{
versionRequest: true,
+ healthRequest: true,
+ liveRequest: true,
+ readyRequest: true,
}
// IsPassThroughPath determining whether the uri can be passed through
i am trying to figure out this patch. this patch is NOT required, right? w/o this patch, are we having any specific logs or waning that we need to take care? could you provide more specifics?
Writing code in KubeEdge that adapts to cilium or other cni does not appear to be very elegant. Perhaps we could solve this issue using configuration or other dynamic ways.
agree. mainline should not be dependent on specific CNI implementations.
But when it comes to 127.0.0.1:10550/readyz?verbose, edgecore is still handling it like 127.0.0.1:10550/readyz. This behavior is different from k8s document's description: https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health.
right, i came to this concern as well. if anything (not sure cilium does nor will in the future) requires verbose
behavior, that would not be supported.
at least, we would want to create the dedicated issue for this to track? what do you think? @Shelley-BaoYue @ZhengXinwei-F
@Windrow14 friendly ping on #4844 (comment)
either @Shelley-BaoYue or @ZhengXinwei-F could you reopen this issue?
My thoughts are,
- #4844 (comment)
- Concreate produce and documentation how to enable
Cilium
withKubeEdge
- Discussion what more can be in KubeEdge generic part and special procedure of
Cilium
. (if the fix or configuration is generic for KubeEdge, that should be integrated in KubeEdge, otherwise we can have special operation and documentation with configuration files to enableCilium
)
what do you think?
Ok, I have reopened this issue. :-)
this patch is NOT required, right?
I think so, I don't see any trouble without this patch. It seems not even printed by existing loggers, we need to add extra loggers to check that name string. If it is only for internal usage, as long as it remains the same, the functions related to it would work properly.
@Shelley-BaoYue can you reopen this? bot systematically closes once corresponding issue closed.
@Windrow14 i would suggest since this is meta-ticket for tracking, we would want to create dedicated issue for specific PR.