SR-IOV & Bond CNI Fails to start and terminate pod
itsalexjones opened this issue · 0 comments
Hi Everyone,
I have deployed the SR-IOV CNI via the SR-IOV Network Device Plugin (v3.7.0) , and the bond CNI (from master, as the latest release is very old) manually and am trying to create a bond interface from two VFs in the pod.
I have used examples from the bond-cni and sr-iov cni documentation to do this, and have previously had single SR-IOV interfaces working correctly.
What happend:
When the pod is started the event Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<snip>": plugin type ="multus" name="multus-cni-network" failed (add): [default/test-pod:sriov-network]: error adding container to network "sriov-network": cannot convert: no valid IP addresses
is logged, and the pod fails to start.
When the pod is terminated, the event error killing pod: failed to "KillPodSandbox" for "<snip>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"<snip>\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: invalid version \"\": the version is empty / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: invalid version \"\": the version is empty"
is logged and the pod fails to be deleted.
What you expected to happen:
All documentation suggests the pod should be started with the four interfaces as configured
How to reproduce it (as minimally and precisely as possible):
Deploy the follwing three Network Attachment Definitions (assume the resources are already created):
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-net1
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_1
spec:
config: '{
"type": "sriov",
"name": "sriov-network",
"spoofchk":"off"
}'
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: sriov-net2
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_2
spec:
config: '{
"type": "sriov",
"name": "sriov-network",
"spoofchk":"off"
}'
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: bond-net1
spec:
config: '{
"type": "bond",
"cniVersion": "0.3.1",
"name": "bond-net1",
"mode": "active-backup",
"failOverMac": 1,
"linksInContainer": true,
"miimon": "100",
"mtu": 1500,
"links": [
{"name": "net1"},
{"name": "net2"}
],
"ipam": {
"type": "host-local",
"subnet": "10.72.0.0/16",
"rangeStart": "10.72.61.192",
"rangeEnd": "10.72.61.255"
}
}'
and the follwoing pod:
apiVersion: v1
kind: Pod
metadata:
name: test-pod
annotations:
k8s.v1.cni.cncf.io/networks: '[
{"name": "sriov-net1",
"interface": "net1"
},
{"name": "sriov-net2",
"interface": "net2"
},
{"name": "bond-net1",
"interface": "bond0"
}]'
spec:
restartPolicy: Never
containers:
- name: bond-test
image: alpine:latest
command:
- /bin/sh
- "-c"
- "sleep 60m"
imagePullPolicy: IfNotPresent
resources:
requests:
intel.com/intel_sriov_PF_1: '1'
intel.com/intel_sriov_PF_2: '1'
limits:
intel.com/intel_sriov_PF_1: '1'
intel.com/intel_sriov_PF_2: '1'
Anything else we need to know?:
If you assign an address to the two SR-IOV interfaces (a static address is fine), the pod is created correctly (but with two extra addresses on the bond slaves) - but the pod still fails to terminate.
Environment:
- Multus version
image path and image ID (from 'docker images'):ghcr.io/k8snetworkplumbingwg/multus-cni:v3.8
- Kubernetes version (use
kubectl version
):v1.29.5
- Primary CNI for Kubernetes cluster: Calico
- OS (e.g. from /etc/os-release): Debian 12
- File of '/etc/cni/net.d/':
{
"cniVersion": "0.4.0",
"name": "multus-cni-network",
"type": "multus",
"capabilities": {
"portMappings": true,
"bandwidth": true
},
"kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
"delegates": [
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"datastore_type": "kubernetes",
"nodename": "lqbkubedab-01",
"type": "calico",
"log_level": "info",
"log_file_path": "/var/log/calico/cni/cni.log",
"ipam": {
"type": "calico-ipam",
"assign_ipv4": "true"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
},
{
"type": "bandwidth",
"capabilities": {
"bandwidth": true
}
}
]
}
]
}
- File of '/etc/cni/multus/net.d'
- NetworkAttachment info (use
kubectl get net-attach-def -o yaml
)
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{},"name":"bond-net1","namespace":"default"},"spec":{"config":"{ \"type\": \"bond\", \"cniVersion\": \"0.3.1\", \"name\": \"bond-net1\", \"mode\": \"active-backup\", \"failOverMac\": 1, \"linksInContainer\": true, \"miimon\": \"100\", \"mtu\": 1500, \"links\": [ {\"name\": \"net1\"}, {\"name\": \"net2\"} ], \"ipam\": { \"type\": \"host-local\", \"subnet\": \"10.72.0.0/16\", \"rangeStart\": \"10.72.61.192\", \"rangeEnd\": \"10.72.61.255\" } }"}}
creationTimestamp: "2024-06-24T13:30:31Z"
generation: 2
name: bond-net1
namespace: default
resourceVersion: "2206601"
uid: 3eac8c19-8674-4c09-bdc8-b5b93246a972
spec:
config: '{ "type": "bond", "cniVersion": "0.3.1", "name": "bond-net1", "mode":
"active-backup", "failOverMac": 1, "linksInContainer": true, "miimon": "100",
"mtu": 1500, "links": [ {"name": "net1"}, {"name": "net2"} ], "ipam": { "type":
"host-local", "subnet": "10.72.0.0/16", "rangeStart": "10.72.61.192", "rangeEnd":
"10.72.61.255" } }'
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_AXIA_1
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"intel.com/intel_sriov_PF_AXIA_1"},"name":"sriov-net1","namespace":"default"},"spec":{"config":"{ \"type\": \"sriov\", \"name\": \"sriov-network\", \"spoofchk\":\"off\" }"}}
creationTimestamp: "2024-06-24T13:30:24Z"
generation: 4
name: sriov-net1
namespace: default
resourceVersion: "2211948"
uid: 043986f3-5e8a-4861-b65d-31232c2b5c07
spec:
config: '{ "type": "sriov", "name": "sriov-network", "spoofchk":"off" }'
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_AXIA_2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"intel.com/intel_sriov_PF_AXIA_2"},"name":"sriov-net2","namespace":"default"},"spec":{"config":"{ \"type\": \"sriov\", \"name\": \"sriov-network\", \"spoofchk\":\"off\" }"}}
creationTimestamp: "2024-06-24T13:30:27Z"
generation: 4
name: sriov-net2
namespace: default
resourceVersion: "2211955"
uid: a25fb747-8ffb-4524-9339-0740e3514f69
spec:
config: '{ "type": "sriov", "name": "sriov-network", "spoofchk":"off" }'
kind: List
metadata:
resourceVersion: ""
- Target pod yaml info (with annotation, use
kubectl get pod <podname> -o yaml
)
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 9840e0391c6916c0182bd20d6cc1bcc71b3bceaee8e091daebd6a48a077dbca3
cni.projectcalico.org/podIP: ""
cni.projectcalico.org/podIPs: ""
k8s.v1.cni.cncf.io/networks: '[ {"name": "sriov-net1", "interface": "net1" },
{"name": "sriov-net2", "interface": "net2" }, {"name": "bond-net1", "interface":
"bond0" }]'
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[ {\"name\": \"sriov-net1\", \"interface\": \"net1\" }, {\"name\": \"sriov-net2\", \"interface\": \"net2\" }, {\"name\": \"bond-net1\", \"interface\": \"bond0\" }]"},"name":"test-pod","namespace":"default"},"spec":{"containers":[{"command":["/bin/sh","-c","sleep 60m"],"image":"alpine:latest","imagePullPolicy":"IfNotPresent","name":"bond-test","resources":{"limits":{"intel.com/intel_sriov_PF_AXIA_1":"1","intel.com/intel_sriov_PF_AXIA_2":"1"},"requests":{"intel.com/intel_sriov_PF_AXIA_1":"1","intel.com/intel_sriov_PF_AXIA_2":"1"}}}],"restartPolicy":"Never"}}
creationTimestamp: "2024-06-24T15:35:25Z"
deletionGracePeriodSeconds: 30
deletionTimestamp: "2024-06-24T15:37:12Z"
name: test-pod
namespace: default
resourceVersion: "2212053"
uid: 23f48611-ed65-4de8-8617-8b0a91591c28
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 60m
image: alpine:latest
imagePullPolicy: IfNotPresent
name: bond-test
resources:
limits:
intel.com/intel_sriov_PF_AXIA_1: "1"
intel.com/intel_sriov_PF_AXIA_2: "1"
requests:
intel.com/intel_sriov_PF_AXIA_1: "1"
intel.com/intel_sriov_PF_AXIA_2: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-k2krr
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: lqbkubedab-01
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-k2krr
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-06-24T15:35:25Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2024-06-24T15:35:25Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-06-24T15:35:25Z"
message: 'containers with unready status: [bond-test]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-06-24T15:35:25Z"
message: 'containers with unready status: [bond-test]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-06-24T15:35:25Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: alpine:latest
imageID: ""
lastState: {}
name: bond-test
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating
hostIP: 10.72.60.30
hostIPs:
- ip: 10.72.60.30
phase: Pending
qosClass: BestEffort
startTime: "2024-06-24T15:35:25Z"
- Other log outputs (if you use multus logging)