aws/amazon-vpc-cni-k8s

disabling netpol feature in v1.14.1 causing aws-nodes pods to crash

tl-alex-nicot opened this issue · 7 comments

What happened:

Attach logs

{"level":"info","ts":"2023-09-13T12:51:42Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x55f8874b0347]

goroutine 72 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:115 +0x1fa
panic({0x55f8880e07c0, 0x55f8894ccd70})
	/root/sdk/go1.20.4/src/runtime/panic.go:884 +0x213
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).configureeBPFProbes(0xc000000180, {0x55f888444df0, 0xc000642a50}, {0xc0000475c0, 0x2a}, {0xc0003c4f60?, 0x1, 0xc00045c480?}, {0xc0000e9580, 0x2, ...}, ...)
	/workspace/controllers/policyendpoints_controller.go:257 +0x3e7
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcilePolicyEndpoint(0xc000000180, {0x55f888444df0, 0xc000642a50}, 0xc00037f040)
	/workspace/controllers/policyendpoints_controller.go:231 +0x7b1
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcile(0xc000000180, {0x55f888444df0, 0xc000642a50}, {{{0xc0004b2bf6, 0xa}, {0xc00045c480, 0x1a}}})
	/workspace/controllers/policyendpoints_controller.go:148 +0x24c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).Reconcile(0xc000000180, {0x55f888444df0, 0xc000642a50}, {{{0xc0004b2bf6, 0xa}, {0xc00045c480, 0x1a}}})
	/workspace/controllers/policyendpoints_controller.go:129 +0x11f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x55f888444df0?, {0x55f888444df0?, 0xc000642a50?}, {{{0xc0004b2bf6?, 0x55f887f78020?}, {0xc00045c480?, 0x55f888431ca8?}}})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002bb180, {0x55f888444d48, 0xc00049e690}, {0x55f8881cc0a0?, 0xc0005767a0?})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314 +0x377
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002bb180, {0x55f888444d48, 0xc00049e690})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:222 +0x587

What you expected to happen:
aws-node pods run with the netpol feature disabled

How to reproduce it (as minimally and precisely as possible):
deploy cni version 1.14.1 with netpol feature flag enabled and then disable the feature flag and the pods that rollout will crash

Anything else we need to know?:

its the aws-eks-nodeagent container and it appears that some of the aws-node pods will run while others get the error posted above, i can delete the errored pods and even the underlying nodes but they will just come back with the error again, we are also getting random pods faling the readiness and liveness checks when before they worked fine.

Environment:

  • Kubernetes version (use kubectl version): v1.25
  • CNI Version 1.14.1
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):

Going by the above logs, it looks like Network policy feature is still enabled in amazon-vpc-cni config map and also with active network policies configured. Please check.

we downgraded back to 1.13.4 our configmap does still say

apiVersion: v1
data:
  enable-network-policy-controller: "true"
  enable-windows-ipam: "false"
kind: ConfigMap
metadata:
  creationTimestamp: "2023-09-01T14:48:56Z"
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.14.1
    helm.sh/chart: aws-vpc-cni-1.14.1
    k8s-app: aws-node
  name: amazon-vpc-cni
  namespace: kube-system
  resourceVersion: "57929693"
  uid: 3ad7ae95-900a-427e-9c32-0577df9b20e7

we use the official eks add on

@tl-alex-nicot how did you "disable" network policy? Are you referring to the --enable-network-policy command line argument for the aws-eks-nodeagent container? The required steps for disabling network policy, which we are working on more documentation for, are:

  1. Set enable-network-policy-controller to false in amazon-vpc-cni ConfigMap
  2. Set --enable-network-policy to false in daemonset container config

As for the comment about downgrading using the EKS Managed Addon API, this is a function of how helm works. The amazon-vpc-cni ConfigMap is introduced in v1.14.0+, so when you downgrade, that resource cannot be deleted by installing the old version. It has to be deleted manually. The same applies to new CRDs on downgrade.

thanks @jdn5126 i did not set it to false on the cm i just set it to false in the advanced config section of the eks add on

Disabling NP via Managed Addons advanced config should’ve ideally set it to false in the CM as well.

Also, before you disable the Network policy feature, we should delete the configured Network policies and that will give a chance to the controller and the agent in the CNI pod to clean up the custom resources they created for policy enforcement. Otherwise stale resources will linger around and can potentially cause issues. Public documentation should’ve the recommended flow for disabling NP. You can check custom resources using ‘kubectl get policyendpoints -A’ and delete them if the feature is already disabled

thanks for the help i was able to successful disable netpols by setting it to false in the configmap and removing the policyendpoints first

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.