kubeedge/kubeedge

KubeEdge edgecore supports CNI Cilium?

fujitatomoya opened this issue · 40 comments

What happened and what you expected to happen:

3rd party CNI implementation support is listed as roadmap here, Roadmap 2021-H1 but we are having several problems once enabling the cilium on edgecore.

How to reproduce it (as minimally and precisely as possible):

after KubeEdge is enabled, install cilium.

Anything else we need to know?:

The followings are related issues aligned with this question.

Let me elaborate our team's investigation on using cilium with KubeEdge so far.

Here is our process of setting up the cluster (cloudcore and edgecore in use are including some workarounds for problems encountered, I'll show you the details also.):

  1. Create the cluster on cloud node with kubeadm init --node-name pc-windrow. Copy /etc/kubernetes/admin.conf to ~/.kube/config.
  2. Untaint the master node with kubectl taint node pc-windrow node-role.kubernetes.io/master:NoSchedule-.
  3. Install cilium into the cluster with cilium --encryption wireguard install.
  4. Add node-role.kubernetes.io/control-plane: '' and tolerations, so that cilium-operator would always be deployed on cloud node. kubectl edit deployment -n kube-system cilium-operator.
  5. Duplicate cilium's daemonsets, one is still for nodes using kubelet, including master node. The other is adjusted for nodes using edgecore, --k8s-api-server=127.0.0.1:10550 is added to cilium-agent's argument. Two daemonsets have different names and the argument mentioned above, all the other things remain the same.
  6. Initialize KubeEdge on master node with sudo keadm init --advertise-address="xxx.xxx.xxx.xxx" --profile version=v1.12.1 --kube-config=/home/windrow/.kube/config.
  7. Add cloudcore in kubectl edit clusterRolebinding cilium to give it permission of accessing cilium's resources.
  8. Enable dynamiccontroller by editing configmap with kubectl edit cm -n kubeedge cloudcore, and deleting the old pod kubectl delete pod -n kubeedge cloudcore-5876c76687-52xr5.
  9. Get token and join the edge node. sudo keadm gettoken, sudo keadm join --cloudcore-ipport=xxx.xxx.xxx.xxx:10000 --kubeedge-version=v1.12.1 --cgroupdriver=systemd --token ********
  10. Edit /etc/kubeedge/config/edgecore.yaml to enable servicebus and metamanager.
  11. Replace /usr/local/bin/edgecore with edgecore with workarounds built from source locally. sudo systemctl daemon-reload and sudo systemctl restart edgecore
  12. Create a fake secret: kubectl create secret generic cilium-clustermesh -n kube-system. It's a workaround for issue 4817.

Besides issue 4817 and issue 4819 mentioned by Fujita-san, we also found that edgecore does not handle GET /version and Get /healthz, so I hard coded a 200 OK for these requests in edgecore.

The current state is, cilium applications are running normally. According to their logs, cilium on different nodes can communicate with each other and decide CIDR for each node. But on edge nodes, new pods deployed are using 169.254.32.0/24 addresses, which are default address pool of docker. So I think CNI plugin is not taking any effect.

10. Edit /etc/kubeedge/config/edgecore.yaml to enable servicebus and metamanager.

In this step, networkPluginName: cni should also be added into the file, under edged.

networkPluginName: cni does not take effect in v1.12.1. v1.12.2 fixed it, you can try it.

networkPluginName: cni does not take effect in v1.12.1. v1.12.2 fixed it, you can try it.

I'm based on the head of origin/release-1.12: commit 3de324ec27e748a70a01e8520aec559516a5854b (HEAD -> release-1.12, tag: v1.12.2, origin/release-1.12). The fix should be included? Let me do some verification.

@Shelley-BaoYue Thanks for your advice!
By the way, is there a way to prevent keadm from pulling and copying new edgecore to /usr/local/bin when keadm join?

Still the problem related to namespace. As you can see in the log, the resource cilium wants to fetch is a CRD in "default" namespace, but message_dispatcher.go thinks there is a namespace whose name is "null".

E0713 13:06:54.931545       1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/24958"
E0713 13:06:54.931612       1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.15fe34e6-b5d5-4ec3-bc1a-47acffdd146b" resourceNamespace="null" resourceName="24958"
E0713 13:06:54.989269       1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/45429"
E0713 13:06:54.989361       1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.1b1c443b-ac3b-4106-9854-bc758675bf24" resourceNamespace="null" resourceName="45429"
E0713 13:06:55.046585       1 message_dispatcher.go:308] "MessageRoute.Resource" err="namespaces \"null\" not found" resource="node/pc-k8s-192-168-0-121/null/ciliumidentity/51110"
E0713 13:06:55.046644       1 message_dispatcher.go:309] "Failed to create objectSync" err="namespaces \"null\" not found" objectSyncName="pc-k8s-192-168-0-121.88fa747d-d459-4698-9572-78ee4d33a380" resourceNamespace="null" resourceName="51110"

My workaround to above problem is (please ignore the logger):

diff --git a/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go b/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
index f692563eb..297bb1f9a 100644
--- a/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
+++ b/cloud/pkg/cloudhub/dispatcher/message_dispatcher.go
@@ -265,6 +265,16 @@ func (md *messageDispatcher) enqueueAckMessage(nodeID string, msg *beehivemodel.
                return
        }
 
+       resourceType, _ := messagelayer.GetResourceType(*msg)
+       if resourceType == beehivemodel.ResourceTypeNamespace {
+               resourceNamespace = resourceName
+       }
+       klog.Errorf("resourceNamespace original: %s", resourceNamespace)
+       if resourceNamespace == "null" {
+               resourceNamespace = "default"
+       }
+       klog.Errorf("resourceNamespace: %s", resourceNamespace)
+
        objectSyncName := synccontroller.BuildObjectSyncName(nodeID, resourceUID)
        objectSync, err := md.objectSyncLister.ObjectSyncs(resourceNamespace).Get(objectSyncName)
 
@@ -295,6 +305,7 @@ func (md *messageDispatcher) enqueueAckMessage(nodeID string, msg *beehivemodel.
                        ObjectSyncs(resourceNamespace).
                        Create(context.Background(), objectSync, metav1.CreateOptions{})
                if err != nil {
+                       klog.ErrorS(err, "MessageRoute.Resource", "resource", msg.GetResource())
                        klog.ErrorS(err, "Failed to create objectSync",
                                "objectSyncName", objectSyncName,
                                "resourceNamespace", resourceNamespace,
diff --git a/cloud/pkg/cloudhub/session/node_session.go b/cloud/pkg/cloudhub/session/node_session.go
index 045410185..767c8b884 100644
--- a/cloud/pkg/cloudhub/session/node_session.go
+++ b/cloud/pkg/cloudhub/session/node_session.go
@@ -386,6 +386,15 @@ func (ns *NodeSession) saveSuccessPoint(msg *beehivemodel.Message) {
                        return
                }
 
+               if resourceType == beehivemodel.ResourceTypeNamespace {
+                       resourceNamespace = resourceName
+               }
+               klog.Errorf("resourceNamespace original: %s", resourceNamespace)
+               if resourceNamespace == "null" {
+                       resourceNamespace = "default"
+               }
+               klog.Errorf("resourceNamespace: %s", resourceNamespace)
+
                objectSyncName := synccontroller.BuildObjectSyncName(ns.nodeID, resourceUID)
 
                if msg.GetOperation() == beehivemodel.DeleteOperation {

However, with this workaround, the connection between cloudcore and edgecore becomes unstable.

Cloud side log:

W0713 14:15:06.532295       1 upstream.go:217] parse message: 68988ccb-ccb8-4461-82b2-6684a901e7d1 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0713 14:15:06.532335       1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
I0713 14:15:06.532518       1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0713 14:15:06.534728       1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/networking.k8s.io/v1/networkpolicies/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message 8450c197-d20b-4bc4-b638-8071e85612c5)
I0713 14:15:06.534867       1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/networking.k8s.io/v1/networkpolicies/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message 8450c197-d20b-4bc4-b638-8071e85612c5))
I0713 14:15:06.535380       1 upstream.go:89] Dispatch message: 2a585f02-cbb8-4d82-ba40-bd6c163294bf
I0713 14:15:06.535424       1 upstream.go:96] Message: 2a585f02-cbb8-4d82-ba40-bd6c163294bf, resource type is: membership/detail
I0713 14:15:06.535469       1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumnodes/null/pc-k8s-192-168-0-121;Verb=update;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message fcac3af0-ff3a-4498-ae48-c998882e8bf6)
E0713 14:15:06.561567       1 application.go:451] [metaserver/applicationCenter]failed to process Application((NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumnodes/null/pc-k8s-192-168-0-121;Verb=update;Status=Rejected;Reason=failed to access cloud Application center: timeout to get response for message fcac3af0-ff3a-4498-ae48-c998882e8bf6)), Operation cannot be fulfilled on ciliumnodes.cilium.io "pc-k8s-192-168-0-121": the object has been modified; please apply your changes to the latest version and try again
I0713 14:15:06.561795       1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/core/v1/namespaces/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message efae166e-d2bd-4986-965a-59776584cf40)
I0713 14:15:06.562148       1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/core/v1/namespaces/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message efae166e-d2bd-4986-965a-59776584cf40))
W0713 14:15:06.864566       1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
I0713 14:15:06.864628       1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0713 14:15:06.864670       1 ws.go:107] failed to read message, error: read tcp xxx.xxx.xxx.xxx:10000->xxx.xxx.yyy.yyy:33914: use of closed network connection
W0713 14:15:06.864685       1 upstream.go:217] parse message: 0bde0350-4994-436d-9a2a-ddc89a7e428a resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
E0713 14:15:06.864696       1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0713 14:15:06.864704       1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
W0713 14:15:06.866193       1 message_dispatcher.go:416] message pool for edge node pc-k8s-192-168-0-121 not found and created now

Edge side log:

7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.186916   49983 websocket.go:51] Websocket start to connect Access
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196177   49983 ws.go:46] dial wss://xxx.xxx.xxx.xxx:10000/e632aba927ea4ac2b575ec1603d56f10/pc-k8s-192-168-0-121/events successfully
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196258   49983 websocket.go:93] Websocket connect to cloud access successful
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196370   49983 process.go:461] node connection event occur: cloud_connected
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196410   49983 process.go:461] node connection event occur: cloud_connected
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.196425   49983 eventbus.go:168] Action not found
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196474   49983 process.go:299] DeviceTwin receive msg
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.196531   49983 process.go:68] Send msg to the CommModule module in twin
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.298635   49983 context_channel.go:164] Get bad anonName:d12873a3-efe7-46fa-a3bb-2703afb9cb58 when sendresp message, do nothing
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.363680   49983 trace.go:205] Trace[1940341198]: "Update" url:/apis/cilium.io/v2/ciliumnodes/pc-k8s-192-168-0-121,user-agent:cilium-agent/1.12.0 9447cd1 2022-07-19T12:22:00+02:00 go version go1.18.4 linux/amd64,audit-id:,client:127.0.0.1,accept:application/json, */*,protocol:HTTP/1.1 (13-Jul-2023 14:09:19.505) (total time: 3858ms):
7月 13 14:09:23 edge121 edgecore[49983]: Trace[1940341198]: [3.858596122s] [3.858596122s] END
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.556967   49983 ws.go:107] failed to read message, error: websocket: close 1006 (abnormal closure): unexpected EOF
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.598141   49983 process.go:161] failed to send message to cloud: failed to send message, error: use of closed network connection
7月 13 14:09:23 edge121 edgecore[49983]: E0713 14:09:23.598264   49983 process.go:112] websocket read error: the fifo is broken
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.598293   49983 edgehub.go:126] connection is broken, will reconnect after 30s
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598322   49983 process.go:299] DeviceTwin receive msg
7月 13 14:09:23 edge121 edgecore[49983]: W0713 14:09:23.598350   49983 eventbus.go:168] Action not found
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598369   49983 process.go:68] Send msg to the CommModule module in twin
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598514   49983 process.go:461] node connection event occur: cloud_disconnected
7月 13 14:09:23 edge121 edgecore[49983]: I0713 14:09:23.598559   49983 process.go:461] node connection event occur: cloud_disconnected
7月 13 14:09:38 edge121 edgecore[49983]: E0713 14:09:38.197144   49983 process.go:183] websocket write error: failed to send message, error: use of closed network connection
7月 13 14:09:43 edge121 edgecore[49983]: E0713 14:09:43.367243   49983 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"failed to access cloud Application center: timeout to get response for message 726de9ad-3070-47f7-b0d5-ca14a3567952"}: failed to access cloud Application center: timeout to get response for message 726de9ad-3070-47f7-b0d5-ca14a3567952
7月 13 14:09:43 edge121 edgecore[49983]: I0713 14:09:43.367458   49983 trace.go:205] Trace[2121639627]: "Update" url:/apis/cilium.io/v2/ciliumnodes/pc-k8s-192-168-0-121,user-agent:cilium-agent/1.12.0 9447cd1 2022-07-19T12:22:00+02:00 go version go1.18.4 linux/amd64,audit-id:,client:127.0.0.1,accept:application/json, */*,protocol:HTTP/1.1 (13-Jul-2023 14:09:33.365) (total time: 10001ms):
7月 13 14:09:43 edge121 edgecore[49983]: Trace[2121639627]: [10.001876031s] [10.001876031s] END

Also, please check this comment in issue 4819: #4819 (comment).
Although that ticket is closed, the MR related to it does not cover the problem mentioned in the ticket. We want to reopen that ticket. Please review the additional patch shown in that comment.

However, with this workaround, the connection between cloudcore and edgecore becomes unstable.

Maybe the connection unstability is not caused by the workaround, but the restart of edgecore.

W0714 11:15:42.218567       1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now
E0714 11:15:47.212098       1 node_session.go:194] syncAckMessage err: use of closed network connection
W0714 11:16:12.244028       1 upstream.go:217] parse message: 2a148267-1355-46c5-b482-2d699c3c8b3a resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0714 11:16:12.244105       1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
I0714 11:16:12.244234       1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0714 11:16:12.246083       1 upstream.go:89] Dispatch message: 43dc2709-fdaa-4953-9154-c97ee8edae3c
I0714 11:16:12.246125       1 upstream.go:96] Message: 43dc2709-fdaa-4953-9154-c97ee8edae3c, resource type is: membership/detail
W0714 11:16:13.373796       1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
E0714 11:16:13.373865       1 node_session.go:194] syncAckMessage err: AckMessageQueue for node pc-k8s-192-168-0-121 has shutdown
E0714 11:16:13.373875       1 ws.go:107] failed to read message, error: read tcp 43.82.125.156:10000->43.82.111.25:38884: use of closed network connection
I0714 11:16:13.373883       1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0714 11:16:13.373893       1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0714 11:16:13.373900       1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
W0714 11:16:13.373929       1 upstream.go:217] parse message: 2e41c8de-d5c7-40db-816c-95e50f5631ed resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
W0714 11:16:13.442387       1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now
I0714 11:16:47.065018       1 message_handler.go:122] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 connected
W0714 11:16:47.065047       1 upstream.go:217] parse message: 6c7356f1-8985-4294-a438-669645547d28 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
I0714 11:16:47.065166       1 node_session.go:136] Start session for edge node pc-k8s-192-168-0-121
I0714 11:16:47.067045       1 upstream.go:89] Dispatch message: e0301055-a4dc-4cd5-b8fe-48ba4a8453f3
I0714 11:16:47.067096       1 upstream.go:96] Message: e0301055-a4dc-4cd5-b8fe-48ba4a8453f3, resource type is: membership/detail
I0714 11:16:47.067223       1 application.go:446] [metaserver/ApplicationCenter] get a Application (NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumendpoints/null/null;Verb=watch;Status=Completed;Reason=failed to access cloud Application center: timeout to get response for message 2321f6e0-3cba-45cf-9cd6-c0ac084fa339)
I0714 11:16:47.068370       1 application.go:455] [metaserver/applicationCenter]successfully to process Application((NodeName=pc-k8s-192-168-0-121;Key=/cilium.io/v2/ciliumendpoints/null/null;Verb=watch;Status=Approved;Reason=failed to access cloud Application center: timeout to get response for message 2321f6e0-3cba-45cf-9cd6-c0ac084fa339))
W0714 11:16:50.038050       1 node_session.go:284] node pc-k8s-192-168-0-121 is deleted, message for node will be cleaned up
I0714 11:16:50.038080       1 message_handler.go:139] edge node pc-k8s-192-168-0-121 for project e632aba927ea4ac2b575ec1603d56f10 disConnected
E0714 11:16:50.038120       1 ws.go:107] failed to read message, error: read tcp 43.82.125.156:10000->43.82.111.25:38888: use of closed network connection
W0714 11:16:50.038124       1 upstream.go:217] parse message: d09bb98e-4213-41ba-89f5-e2c5f0d14229 resource type with error, message resource: node/pc-k8s-192-168-0-121, err: resource type not found
E0714 11:16:50.038135       1 message_handler.go:166] projectID e632aba927ea4ac2b575ec1603d56f10 node pc-k8s-192-168-0-121 read message err
E0714 11:16:50.038146       1 message_handler.go:170] session not found for node pc-k8s-192-168-0-121
E0714 11:16:55.037746       1 node_session.go:194] syncAckMessage err: use of closed network connection
W0714 11:16:55.180870       1 message_dispatcher.go:417] message pool for edge node pc-k8s-192-168-0-121 not found and created now

Looks like cloudhub thinks one session is already closed, so it close the websocket connection to that node (in SendAckMessage in node_session.go). Although the session is created again because of new message from edgecore, the EOF error cause by the shutdown make edgecore abandon that session. And then cloudhub thinks the session dead again, and wants to terminate it (SendNoAckMessage). This logic loops again and again...

Let me update the current situation (2023.7.27):

  1. With a number of workarounds, and a sequence of steps with strict order, we can make edge node managed by KubeEdge work with Cilium. No obvious problem for creating pods and pinging IP address between pods on different nodes.
  2. Still some error prints in edgecore's journal for failure to get ciliumidentity, but no consequence is seen so far.
  3. In-cluster DNS on edge node needs EdgeMesh. However, on my setup, it works only for two edgecore-managed nodes or on edgecore-managed node locally, but not between edgecore-manage node and kubelet-managed node.

Current steps for setting up the cluster:

  1. Prepare cloudcore's image and edgecore's binary.
  2. Create the cluster on cloud node.
    sudo kubeadm init --node-name pc-windrow
    mkdir ~/.kube
    sudo cp /etc/kubernetes/admin.conf ~/.kube/config
    sudo chown $(id -u):$(id -g) ~/.kube/config
    sudo mkdir /root/.kube
    sudo cp /etc/kubernetes/admin.conf /root/.kube/config
    
  3. Untaint the cloud node.
    kubectl taint node pc-windrow node-role.kubernetes.io/master:NoSchedule-
    
  4. Install Cilium.
    cilium --encryption wireguard install
    
  5. Make cilium-operator always be deployed on cloud node by adding node-role.kubernetes.io/control-plane: '' and tolerations. cilium-operator.txt
    kubectl apply -f cilium-operator.yaml
    
  6. Duplicate cilium's daemonsets, one is still for nodes using kubelet, including master node. The other is adjusted for nodes using edgecore, --k8s-api-server=127.0.0.1:10550 is added to cilium-agent's argument. Two daemonsets have different names and the argument mentioned above, all the other things remain the same. cilium-kubelet.txt cilium-kubeedge.txt
    kubectl apply -f cilium-kubelet.yaml
    kubectl apply -f cilium-kubeedge.yaml
    
  7. Initialize KubeEdge on cloud node.
    sudo keadm init --advertise-address="xxx.xxx.xxx.xxx" --profile version=v1.12.2 --kube-config=/home/windrow/.kube/config
    
  8. Give cloudcore access of cilium's resources by editing clusterRolebinding. cilium-clusterrole.txt cilium-clusterrolebinding.txt
    kubectl apply -f cilium-clusterrole.yaml
    kubectl apply -f cilium-clusterrolebinding.yaml
    
  9. Enable dynamiccontroller by editing configmap. kubeedge-configmap-cloudcore.txt
    kubectl apply -f kubeedge-configmap-cloudcore.yaml
    kubectl delete pod -n kubeedge -l kubeedge=cloudcore
    
  10. Get token after cloudcore has restarted.
    sudo keadm gettoken
    
  11. Join edge node. (Remember to stop and remove mosquitto's container before joining, it occupies port 1883.)
    sudo keadm join --cloudcore=xxx.xxx.xxx.xxx:10000 --kubeedge-version=v1.12.2 --cgroupdriver=systemd --edgenode-name pc-k8s-192-168-0-121 --token e99dced4689534f29b1502754c7fa63c90834650285aa56e0079d7b794b1676b.eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2OTA1MTk4Nzh9.hMA-6WCnOVmEXwGgGCsE7Dc8sufUD_CJ6o-pIrdPl8c
    
  12. Edit /etc/kubeedge/config/edgecore.yaml. edgecore.txt
    diff --git a/edgecore.yaml b/edgecore.yaml
    index 82e5ece..c011e78 100644
    --- a/edgecore.yaml
    +++ b/edgecore.yaml
    @@ -55,6 +55,7 @@ modules:
         masterServiceNamespace: default
         maxPerPodContainerCount: 1
         minimumGCAge: 0s
    +    networkPluginName: cni
         networkPluginMTU: 1500
         nonMasqueradeCidr: 10.0.0.0/8
         podSandboxImage: kubeedge/pause:3.1
    @@ -69,6 +70,9 @@ modules:  <-- This block is for EdgeMesh, we can neglect it if not using EdgeMesh.
           address: 127.0.0.1
           cgroupDriver: systemd
           cgroupsPerQOS: true
    +      clusterDNS:
    +      - 169.254.96.16
    +      clusterDomain: cluster.local
           configMapAndSecretChangeDetectionStrategy: Get
           containerLogMaxFiles: 5
           containerLogMaxSize: 10Mi
    @@ -157,14 +161,14 @@ modules:
         contextSendModule: websocket
         enable: true
         metaServer:
    -      enable: false
    +      enable: true
           server: 127.0.0.1:10550
           tlsCaFile: /etc/kubeedge/ca/rootCA.crt
           tlsCertFile: /etc/kubeedge/certs/server.crt
           tlsPrivateKeyFile: /etc/kubeedge/certs/server.key
         remoteQueryTimeout: 60
       serviceBus:
    -    enable: false
    +    enable: true
         port: 9060
         server: 127.0.0.1
         timeout: 60
    
  13. Replace edgecore binary. (The binary at /usr/local/bin/edgecore is re-downloaded each time keadm join ... is used.)
    sudo systemctl stop edgecore
    sudo cp edgecore /usr/local/bin/edgecore
    sudo systemctl daemon-reload
    sudo systemctl start edgecore
    
  14. Create a fake secret. (A workaround for issue 4817.)
    kubectl create secret generic cilium-clustermesh -n kube-system
    
  15. Deploy EdgeMesh. (To check further, not required.)

Current code changes in kubeedge:

  1. Error when get namespace through metaserver.
  2. Error when get version and healthz through metaserver.
  3. Error when get node through metaserver.

version api through metaserver will be supported in next version (v1.15) @ZhengXinwei-F, and you're welcome to contribute your code changes.

@Shelley-BaoYue Thanks for the information! These code changes are mostly ugly workarounds. I'm verifying feasibility so far.

Is getting a CRD, such as ciliumendpoint, through metaserver to be supported some day? Currently, it is neglected in switch function of syncPod in edged.go.

we will consider it as a feature, it may take a while to achieve it. And welcome to participate in the discussion at the community meeting and describe your requirements in detail.

@Shelley-BaoYue Thanks for the information! These code changes are mostly ugly workarounds. I'm verifying feasibility so far.

Is getting a CRD, such as ciliumendpoint, through metaserver to be supported some day? Currently, it is neglected in switch function of syncPod in edged.go.

I agree with @Shelley-BaoYue, we will consider it as a feature.

And here is the current version's interim solution for this issue.

In terms of code implementation, the present version of metaServer enables getting CR.
But i discovered that the request fails due to cloudcore's lack of permissions. such as the following:
PS. My linux kernel version does not support cilium; in this case, we will use calico as an example.

curl 127.0.0.1:10550/apis/crd.projectcalico.org/v1/ipamblocks
{
  "apiVersion": "crd.projectcalico.org/v1",
  "items": [],
  "kind": "IpamblockList"
}

cloudcore logs:

E0731 17:16:38.321093 1 application.go:61] [metaserver/applicationCenter]failed to process Application((NodeName=node1;Key=/crd.projectcalico.org/v1/ipamblocks/null/null;Verb=list;Status=Rejected;Reason=get current list error: ipamblocks.crd.projectcalico.org is forbidden: User "system:serviceaccount:kubeedge:cloudcore" cannot list resource "ipamblocks" in API group "crd.projectcalico.org" at the cluster scope)), get current list error: ipamblocks.crd.projectcalico.org is forbidden: User "system:serviceaccount:kubeedge:cloudcore" cannot list resource "ipamblocks" in API group "crd.projectcalico.org" at the cluster scope

To grant cloudcore cr permission, we can update the clusterrolebing called cloudcore as follows:

kubectl edit clusterrole cloudcore (ps, you can also create a new clusterrole and clusterrolebinding)

- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipreservations
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  verbs:
  - get
  - list
  - watch

then

curl 127.0.0.1:10550/apis/crd.projectcalico.org/v1/ipamblocks
{
  "apiVersion": "crd.projectcalico.org/v1",
  "items": [
    {
      "apiVersion": "crd.projectcalico.org/v1",
      "kind": "IPAMBlock",
      "metadata": {
        "annotations": {
          "projectcalico.org/metadata": "{\"creationTimestamp\":null}"
        },
        "creationTimestamp": "2023-07-29T02:50:32Z",
        "generation": 13,
        "managedFields": [
          {
            "apiVersion": "crd.projectcalico.org/v1",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:metadata": {
                "f:annotations": {
.......

https://github.com/kubeedge/kubeedge/blob/0b49dc676d6ba61d0aeca24c97dfd5069435ac78/build/cloud/ha/01-ha-prepare.yaml#L27C7-L27C7

kubectl edit clusterrole cloudcore (ps, you can also create a new clusterrole and clusterrolebinding)

Yes, I also edit the clusterrole and clusterrolebinding in step 8 of #4844 (comment). Let me have a double check.

kubectl edit clusterrole cloudcore (ps, you can also create a new clusterrole and clusterrolebinding)

Yes, I also edit the clusterrole and clusterrolebinding in step 8 of #4844 (comment). Let me have a double check.

https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles
Maybe the "Aggregated ClusterRoles" function of K8s can help you handle this issue. We'll also take this approach into account for future demand development.

like:

cloudcore clusterrole:

aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.authorization.k8s.io/aggregate-to-cloudcore: "true"

cilium and calico, and others:

metadata:
  labels:
    rbac.authorization.k8s.io/aggregate-to-cloudcore: "true"

Hi @fisherxu , excuse me, is there somewhere I can find a document for submodules such as dynamiccontroller?

I double checked the patch I provided(https://github.com/kubeedge/kubeedge/files/12181548/20230727_kubeedge_cilium.txt) and existing issue tickets, I've found that there are plenty of existing tickets for these workarounds.

List of current blocking issues:
#2956, #4420, #4453, #4959, #5042.

#4904 this patch should also relate to this issue cc @ZhengXinwei-F

@Windrow14 Could you please open a PR to submit your patch?

@Windrow14 Could you please open a PR to submit your patch?

@fisherxu Let me have a check with release 1.14 first, it seems some problems are already fixed there. Then I'll create a PR for remaining things.

#4589 I encountered the problem described in this ticket... @fisherxu @Shelley-BaoYue

Error when get version and healthz through metaserver.

@Windrow14 we surely want to add the support for /healthz aligned with #4904?

Error when get version and healthz through metaserver.

@Windrow14 we surely want to add the support for /healthz aligned with #4904?

You can add the /healthz transparent transmission capability after the merging #4904 is complete.

Error when get version and healthz through metaserver.

@Windrow14 we surely want to add the support for /healthz aligned with #4904?

@fujitatomoya PR #4904 has been merged. Based on this PR, you can support more URIs such as '/healthz'.

P.S. Would you be interested in creating some E2E tests for metaServer's non-resources URIs? : )

@ZhengXinwei-F thanks for the contribution.
CC: @Windrow14

Based on this PR, you can support more URIs such as '/healthz'.

we would add it need-to-add basis for that.

#4589 I encountered the problem described in this ticket... @fisherxu @Shelley-BaoYue

#4843 (comment)

// Sometimes, we need guess kind according to resource:
// 1. In most cases, is like pods to Pod,
// 2. In some unusual cases, requires special treatment like endpoints to Endpoints
func UnsafeResourceToKind(r string) string {
	if len(r) == 0 {
		return r
	}
	unusualResourceToKind := map[string]string{
		"endpoints":                    "Endpoints",
		"endpointslices":               "EndpointSlice",
		"nodes":                        "Node",
		"namespaces":                   "Namespace",
		"services":                     "Service",
		"podstatus":                    "PodStatus",
		"nodestatus":                   "NodeStatus",
		"customresourcedefinitions":    "CustomResourceDefinition",
		"customresourcedefinitionlist": "CustomResourceDefinitionList",
	}
	if v, isUnusual := unusualResourceToKind[r]; isUnusual {
		return v
	}
	caser := cases.Title(language.Und)
	k := caser.String(r)
	switch {
	case strings.HasSuffix(k, "ies"):
		return strings.TrimSuffix(k, "ies") + "y"
	case strings.HasSuffix(k, "es"):
		return strings.TrimSuffix(k, "es")
	case strings.HasSuffix(k, "s"):
		return strings.TrimSuffix(k, "s")
	}
	return k
}

This function used in func (f *Factory) Create(req *request.RequestInfo) http.Handler may turns CRDs to unrecognized words, for example, ciliumnodes turns to ciliumnod. Is there a why to append unusualResourceToKind dynamically?

edgecore-panic-when-receive-post.log cilium-agent-post.pcap.txt

Empty options.FieldValidation makes validationDirective equals to metav1.FieldValidationWarn, leads to decodeSerializer = s.StrictSerializer, which is nil.
nil decodeSerializer makes nil decoder, which panics when called.

Screenshot_from_2023-11-02_11-11-24
Screenshot from 2023-11-02 11-08-50

Fix to the panic issue above:

diff --git a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
index 62c9e83ac..e06570de3 100644
--- a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
+++ b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
@@ -31,6 +31,7 @@ func (f WithoutConversionCodecFactory) SupportedMediaTypes() []runtime.Serialize
                        EncodesAsText:    true,
                        Serializer:       json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),
                        PrettySerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: true}),
+                       StrictSerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Strict: true}),
                        StreamSerializer: &runtime.StreamSerializerInfo{
                                EncodesAsText: true,
                                Serializer:    json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),

Patch for UnsafeResourceToKind, but it seems still working without this patch, though if you print Kind in edgecore, you will see Ciliumnod.

diff --git a/pkg/metaserver/util/util.go b/pkg/metaserver/util/util.go
index f549faf0d..266e1ed82 100644
--- a/pkg/metaserver/util/util.go
+++ b/pkg/metaserver/util/util.go
@@ -57,6 +57,7 @@ func UnsafeResourceToKind(r string) string {
                "nodestatus":                   "NodeStatus",
                "customresourcedefinitions":    "CustomResourceDefinition",
                "customresourcedefinitionlist": "CustomResourceDefinitionList",
+               "ciliumnodes":                  "CiliumNode",
        }
        if v, isUnusual := unusualResourceToKind[r]; isUnusual {
                return v
@@ -84,6 +85,7 @@ func UnsafeKindToResource(k string) string {
                "NodeStatus":                   "nodestatus",
                "CustomResourceDefinition":     "customresourcedefinitions",
                "CustomResourceDefinitionList": "customresourcedefinitionlist",
+               "CiliumNode":                   "ciliumnodes",
        }
        if v, isUnusual := unusualKindToResource[k]; isUnusual {
                return v

Fix to the panic issue above:

diff --git a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
index 62c9e83ac..e06570de3 100644
--- a/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
+++ b/edge/pkg/metamanager/metaserver/kubernetes/serializer/serializer.go
@@ -31,6 +31,7 @@ func (f WithoutConversionCodecFactory) SupportedMediaTypes() []runtime.Serialize
                        EncodesAsText:    true,
                        Serializer:       json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),
                        PrettySerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: true}),
+                       StrictSerializer: json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Strict: true}),
                        StreamSerializer: &runtime.StreamSerializerInfo{
                                EncodesAsText: true,
                                Serializer:    json.NewSerializerWithOptions(json.DefaultMetaFactory, f.creator, f.typer, json.SerializerOptions{Pretty: false}),

Thank you for your contribution. Could you please submit a PR to resolve this issue?

Patch for UnsafeResourceToKind, but it seems still working without this patch, though if you print Kind in edgecore, you will see Ciliumnod.

diff --git a/pkg/metaserver/util/util.go b/pkg/metaserver/util/util.go
index f549faf0d..266e1ed82 100644
--- a/pkg/metaserver/util/util.go
+++ b/pkg/metaserver/util/util.go
@@ -57,6 +57,7 @@ func UnsafeResourceToKind(r string) string {
                "nodestatus":                   "NodeStatus",
                "customresourcedefinitions":    "CustomResourceDefinition",
                "customresourcedefinitionlist": "CustomResourceDefinitionList",
+               "ciliumnodes":                  "CiliumNode",
        }
        if v, isUnusual := unusualResourceToKind[r]; isUnusual {
                return v
@@ -84,6 +85,7 @@ func UnsafeKindToResource(k string) string {
                "NodeStatus":                   "nodestatus",
                "CustomResourceDefinition":     "customresourcedefinitions",
                "CustomResourceDefinitionList": "customresourcedefinitionlist",
+               "CiliumNode":                   "ciliumnodes",
        }
        if v, isUnusual := unusualKindToResource[k]; isUnusual {
                return v

Writing code in KubeEdge that adapts to cilium or other cni does not appear to be very elegant. Perhaps we could solve this issue using configuration or other dynamic ways.

cc @fisherxu @Shelley-BaoYue

One more thing, if we add support for healthz, readyz and livez like below, it is fine to handle 127.0.0.1:10550/readyz. But when it comes to 127.0.0.1:10550/readyz?verbose, edgecore is still handling it like 127.0.0.1:10550/readyz. This behavior is different from k8s document's description: https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health. What's your opinion on this?

diff --git a/pkg/util/pass-through/pass_through.go b/pkg/util/pass-through/pass_through.go
index 225512546..c2e2dbd1b 100644
--- a/pkg/util/pass-through/pass_through.go
+++ b/pkg/util/pass-through/pass_through.go
@@ -4,10 +4,16 @@ type passRequest string
 
 const (
        versionRequest passRequest = "/version::get"
+       healthRequest  passRequest = "/healthz::get"
+       liveRequest    passRequest = "/livez::get"
+       readyRequest   passRequest = "/readyz::get"
 )
 
 var passThroughMap = map[passRequest]bool{
        versionRequest: true,
+       healthRequest:  true,
+       liveRequest:    true,
+       readyRequest:   true,
 }
 
 // IsPassThroughPath determining whether the uri can be passed through

@Windrow14

#4844 (comment)

i am trying to figure out this patch. this patch is NOT required, right? w/o this patch, are we having any specific logs or waning that we need to take care? could you provide more specifics?

Writing code in KubeEdge that adapts to cilium or other cni does not appear to be very elegant. Perhaps we could solve this issue using configuration or other dynamic ways.

agree. mainline should not be dependent on specific CNI implementations.

But when it comes to 127.0.0.1:10550/readyz?verbose, edgecore is still handling it like 127.0.0.1:10550/readyz. This behavior is different from k8s document's description: https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health.

right, i came to this concern as well. if anything (not sure cilium does nor will in the future) requires verbose behavior, that would not be supported.
at least, we would want to create the dedicated issue for this to track? what do you think? @Shelley-BaoYue @ZhengXinwei-F

either @Shelley-BaoYue or @ZhengXinwei-F could you reopen this issue?

My thoughts are,

  • #4844 (comment)
  • Concreate produce and documentation how to enable Cilium with KubeEdge
  • Discussion what more can be in KubeEdge generic part and special procedure of Cilium. (if the fix or configuration is generic for KubeEdge, that should be integrated in KubeEdge, otherwise we can have special operation and documentation with configuration files to enable Cilium)

what do you think?

Ok, I have reopened this issue. :-)

this patch is NOT required, right?

I think so, I don't see any trouble without this patch. It seems not even printed by existing loggers, we need to add extra loggers to check that name string. If it is only for internal usage, as long as it remains the same, the functions related to it would work properly.

@Shelley-BaoYue can you reopen this? bot systematically closes once corresponding issue closed.

@Windrow14 i would suggest since this is meta-ticket for tracking, we would want to create dedicated issue for specific PR.