aws/amazon-vpc-cni-k8s

NetworkPolicy blocks liveness and readiness probes

paleg opened this issue · 10 comments

What happened:
We are migrating to VPC CNI network policy engine from Calico. One of the applications we are using is Fluxcd, which by default creates several network policies while bootstrapping itself. These network policies block liveness and readiness probes, preventing Fluxcd bootstrap. The very same networks policies managed by Calico work fine.

I created a new EKS cluster v1.27 (platform version eks.5) with v1.15.0-eksbuild.2 VPC CNI, enabled network policy (enableNetworkPolicy=true), and tried to bootstrap Fluxcd v2.1.0 into it.

The bootstrap fails when network policy for Flux is enabled, as Flux controllers fail liveness and readiness probes (Client.Timeout exceeded while awaiting headers). The Flux bootstraps successfully as soon as I either disable network policy creation on a bootstrap level or just remove network policy manually with kubectl.

The network policies Flux uses are described here: https://fluxcd.io/flux/flux-e2e/#fluxs-default-configuration-for-networkpolicy

Here is the definition of policies used:

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/instance: flux-system
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/version: v2.1.0
  name: allow-egress
  namespace: flux-system
spec:
  egress:
  - {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/instance: flux-system
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/version: v2.1.0
  name: allow-scraping
  namespace: flux-system
spec:
  ingress:
  - from:
    - namespaceSelector: {}
    ports:
    - port: 8080
      protocol: TCP
  podSelector: {}
  policyTypes:
  - Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/instance: flux-system
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/version: v2.1.0
  name: allow-webhooks
  namespace: flux-system
spec:
  ingress:
  - from:
    - namespaceSelector: {}
  podSelector:
    matchLabels:
      app: notification-controller
  policyTypes:
  - Ingress
$ kubectl get pods
NAME                                       READY   STATUS             RESTARTS         AGE
helm-controller-57f465457d-jf75t           0/1     CrashLoopBackOff   13 (3m1s ago)    36m
kustomize-controller-5566bd9b7-j88v5       0/1     CrashLoopBackOff   11 (4m57s ago)   66m
notification-controller-7f5b879f85-bgzbq   1/1     Running            1 (31m ago)      75m
source-controller-54944649ff-gmkzx         0/1     Running            1 (24s ago)      55s

Interestingly enough, notification controller passes probes (maybe because it's mentioned in second netpol, although health check port is different from one specified in netpol).

The probes are defined as follows:

$ kubectl get deploy source-controller -o yaml
...
livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: healthz
    scheme: HTTP
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1
ports:
- containerPort: 9090
  name: http
  protocol: TCP
- containerPort: 8080
  name: http-prom
  protocol: TCP
- containerPort: 9440
  name: healthz
  protocol: TCP
readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: http
    scheme: HTTP
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1
$ kubectl describe pods source-controller-54944649ff-gmkzx
...
  Normal   Scheduled  90s                 default-scheduler  Successfully assigned flux-system/source-controller-54944649ff-gmkzx to ip-10-0-25-131.eu-central-1.compute.internal
  Normal   Pulled     59s (x2 over 89s)   kubelet            Container image "ghcr.io/fluxcd/source-controller:v1.1.0" already present on machine
  Normal   Created    59s (x2 over 89s)   kubelet            Created container manager
  Normal   Started    58s (x2 over 89s)   kubelet            Started container manager
  Warning  Unhealthy  29s (x11 over 87s)  kubelet            Readiness probe failed: Get "http://10.0.26.118:9090/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  29s (x6 over 79s)   kubelet            Liveness probe failed: Get "http://10.0.26.118:9440/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    29s (x2 over 59s)   kubelet            Container manager failed liveness probe, will be restarted

What you expected to happen:
I expect probes to pass regardless of network policies used as they are performed by kubelet on the same node and according to documentation "traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node".

How to reproduce it (as minimally and precisely as possible):
On EKS which uses VPC CNI with network policy enabled try to bootstrap flux with --network-policy option enabled.

Environment:

  • Kubernetes version: v1.27.4-eks-2d98532
  • CNI Version: v1.15.0-eksbuild.2
  • OS: Amazon Linux 2
  • Kernel: 5.10.186-179.751.amzn2.x86_64

@paleg Known issue tracked here - aws/aws-network-policy-agent#56 . Issue is fixed and should be available in the next release.

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Can you please release this soon?

@runningman84 the release process has started and the target for general availability is mid-October

Any update on this?

Hello, I've encountered a similar issue even after upgrading the VPC CNI version to 1.15.1, and my agent version is 1.0.4. Unfortunately, my startup probe is still failing, and it appears to be blocked by a network policy. Is there something I might be missing or overlooking? Thanks in advance

@atilsensalduz please file a new issue at https://github.com/aws/aws-network-policy-agent/issues with some more information and we can help

thanks @jdn5126, I raised the following issue. I was waiting for this issue when I tried the new version and saw that it doesn't work, I thought I'd ask directly here.
aws/aws-network-policy-agent#108