microsoft/Windows-Containers

Regression in Windows 11 Build 22621.x - pod networking can not reach any destination

hach-que opened this issue · 20 comments

(This bug was originally posted to microsoft/SDN#563, but I've moved it here because this feels like a more appropriate avenue to report it)

I've been experimenting with getting Windows 11 to work as a Kubernetes node for dev/testing purposes, and I've run into an issue where I've found a regression between Build 22000 and Build 22621 where pods can no longer reach any destination (Internet, service addrs, other pod addrs).

Background

I have been working on a tool called RKM which does all the required set up for Kubernetes in development mode (without requiring VMs). It does all of the configuration, starts components, configures network, etc. etc.

Now I started out trying to get things working with Flannel VXLAN; I got Linux working fine but no luck on Windows - pods could not reach anywhere. After a little discussion in the sig-windows Slack, I was convinced to try setting up Calico instead. All of the testing results from here on out are from the calico branch of RKM (linked above).

For "Does not work", it means "pods can not reach the outside world or any other services or pods (the only thing that works is pinging 127.0.0.1)".
For "Works", it means "everything works, including DNS resolution of service names inside the Kubernetes cluster".

  • Fresh Windows 11 22H2 22261.1265 installed in a VM: ❌ Does not work
  • Fresh Windows 11 21H2 22000.194 installed in a VM: ✅ Works
    • Again, checkpointed so I could restore each time to test.
    • Installed using Rufus to get the 22000 ISO. Paused Windows updates to prevent any from installing.
  • My existing dev machine (bare metal) Windows 11 21H1 22000.1574: ✅ Works

Reproduction Steps

The reproduction steps are fairly simple, but you will need a Linux VM to act as the Kubernetes master. All of these VMs should be on the same subnet (e.g. 10.7.0.0/16).

Build RKM on the calico branch

  • Clone RKM from the calico branch and build it. You can build this in Visual Studio; it's a .NET 7 console app so we'll just be using the same binaries on every machine.
  • If you don't trust me, you can build this in a VM as well if you want to be extra safe.
  • After building you should have a net7.0 directory inside src\rkm-daemon\Debug.

Set up the Linux VM

  • Create an Ubuntu 20.04 LTS VM, with some reasonable resources (2 cores, 4GB RAM).
  • You'll probably want to get SSH working so you can SSH/SCP into this box.
  • Install .NET 7.
  • Install the conntrack package.
  • Copy the RKM binaries to the Linux VM.
  • As the root user, run dotnet rkm-daemon.dll and leave it running. It'll take a moment to start up all the components.
  • In a second shell or SSH session, you should be able to inspect the cluster by running /opt/rkm/$(hostname)*/kubectl get pods --all-namespaces. Once that's working and you can see the core pods, you're ready to move onto setting up the Windows VMs.

Set up the Windows VMs

You're going to set up two VMs here to contrast the build numbers. How you get 22000 and 22261 on the machines isn't very important (I'm sure internally Microsoft has easier access to ISOs than I do), as long as you get one machine on 22000.x and the other on 22261.x.

Then on each machine:

  • Install the .NET 7 runtime.
  • Copy the RKM binaries to each Windows VM.
  • In an Administrative Command Prompt, run dotnet rkm-daemon.dll.
  • The first time this runs it will install the Windows Container feature and automatically reboot.
  • After the reboot, open the Admin Command Prompt and run dotnet rkm-daemon.dll to continue the set up process.

Checking that everything is working

Back on the linux VM, you should now be able to run /opt/rkm/$(hostname)*/kubectl get nodes -o wide and see output that looks similar to this (your hostnames will vary):

NAME         STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
deadpool     Ready      <none>   82m   v1.26.1   10.7.0.157    <none>        Windows 10 Pro       10.0.22000.194      containerd://1.6.18+unknown
gambit       Ready      <none>   13h   v1.26.1   10.7.0.156    <none>        Windows 10 Pro       10.0.22621.1265     containerd://1.6.18+unknown
sentry       Ready      <none>   17h   v1.26.1   10.7.0.32     <none>        Ubuntu 20.04.5 LTS   5.4.0-139-generic   containerd://1.6.18

Deploy the testing manifest

Copy this file to the Linux VM, and then apply it to the cluster with /opt/rkm/$(hostname)*/kubectl apply -f ...:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    matchLabels:
      name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      containers:
      - name: nginx
        image: nginx
      terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
    - name: nginx
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    name: nginx
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: busybox
  namespace: default
spec:
  selector:
    matchLabels:
      name: busybox
  template:
    metadata:
      labels:
        name: busybox
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      containers:
      - name: busybox
        image: busybox
        stdin: true
        tty: true
      terminationGracePeriodSeconds: 30
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: win-test-nano
  namespace: default
spec:
  selector:
    matchLabels:
      name: win-test-nano
  template:
    metadata:
      labels:
        name: win-test-nano
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      containers:
      - name: win-test-nano
        image: mcr.microsoft.com/powershell:lts-7.2-nanoserver-ltsc2022
        stdin: true
        tty: true
      terminationGracePeriodSeconds: 30

You can then get the names of the pods for further testing steps with: /opt/rkm/$(hostname)*/kubectl get pods -o wide.

Test that Linux networking is working

This should pass easily, but as a sanity check, use /opt/rkm/$(hostname)*/kubectl attach -it <name of busybox pod> and then run ping 1.1.1.1. You should be able to get to the Internet.

You could also deploy an Ubuntu container and install curl and test connectivity that way, but the Linux side of things is stable and doesn't really need further confirmation that it works beyond running a simple ping.

Compare Windows networking

Ok, time for the part where we actually see the problem. When you ran /opt/rkm/$(hostname)*/kubectl get pods -o wide you will have seen which nodes each of the nano containers are running on. You need to pair this up with the kernel versions shown in /opt/rkm/$(hostname)*/kubectl get nodes -o wide to know which pod is running under which kernel version.

For the pod that is running on 22000.x, use /opt/rkm/$(hostname)*/kubectl attach -it <pod name> and then run the following commands:

curl -v https://1.1.1.1
curl -v https://google.com
curl -v http://nginx

This should all be working, and you should get responses.

For the pod that is running 22261.x, repeat the test and commands above. You will see that you just get timeouts and no connectivity, and thus you have reproduced the issue.

Workaround

There's no known workaround at the moment, because it's impossible to downgrade Windows 11 (and I'm not sure you can even hold off updates that long if Windows decides it wants to update).

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

Looking into this. I created an internal ticket (#44339550) for tracking.

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

Any update on the internal ticket for this? Would be nice to know if there is an ETA for a fix, since locking installs to 22000 and not upgrading is not ideal for various reasons.

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

As of 30/07/2023, locking a Windows installation to 22000 via the registry no longer works and you'll be force upgraded past the TargetReleaseVersionInfo. Unfortunately it appears that this forced upgrade also bricks Windows Update and leaves the computer in a perpetual update/reboot loop that even System Restore can not recover from (and now I'm re-installing Windows, which by its nature means I will be on 22H2).

Is there any update on this issue @fady-azmy-msft?

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@hach-que, would you be able to try reproducing this issue on Windows Server 2022? This scenario isn't supported on client skus

Windows Containers aren't supported on current Windows 11 clients? That's news to me.

Last I heard, the rational for not publishing 22000 and later client versions of https://hub.docker.com/_/microsoft-windows was that it was no longer necessary - because Microsoft was saying that ltsc2022 images would run on Windows 11 (#117 (comment)). If the ltsc2022 image is no longer compatible with Windows 11, does that mean we'll get a 22621 client image published to Docker for use with Windows 11?

@hach-que , you are correct Windows Containers are supported on windows 11 clients. I meant to say that overlay or l2bridge networking are both server specific scenarios. I'm assuming you are using one of these networking options with Flannel VXLAN.

How would we test Kubernetes in a development scenario without Calico? Are we expected to license Datacenter just to do development on Kubernetes Windows support?

Calico currently works on 22000 both server and client. There's really no practical reason for Calico not to work on 22621 - it's probably a pretty good indicator there's a regression in the kernel or networking components that would break Calico on Server 2025, which would thus mean it needs to get fixed anyway.

Hi @hach-que , I tried cloning your repo to try reproducing the issue. However, I get permission errors (while trying both the https link and the git link). I tried both ssh (by adding a public key) and an access token. I am probably doing something wrong. Is there a wiki you have which I can refer to be able to clone your repo?

Got the repo cloned. Built rkm in visual studio. Tried starting the dotnet application in Ubuntu VM. Getting the below error.

Oct 19 17:40:40 systemd[1]: rkm.service: Scheduled restart job, restart counter is at 1.
Oct 19 17:40:40 systemd[1]: Stopped RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine..
Oct 19 17:40:40 systemd[1]: Started RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine..
Oct 19 17:40:40 systemd[37967]: rkm.service: Failed to execute command: No such file or directory
Oct 19 17:40:40 systemd[37967]: rkm.service: Failed at step EXEC spawning /home/jayantha/rkm_stuff/net7.0/rkm: No such file or directory

@hach-que, If you would still like us debugging this, to start with, could you please share the output of (Get-HnsNetwork | convertto-json) from one of the Windows VMs?

Got the repo cloned. Built rkm in visual studio. Tried starting the dotnet application in Ubuntu VM. Getting the below error.

Oct 19 17:40:40 systemd[1]: rkm.service: Scheduled restart job, restart counter is at 1. Oct 19 17:40:40 systemd[1]: Stopped RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine.. Oct 19 17:40:40 systemd[1]: Started RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine.. Oct 19 17:40:40 systemd[37967]: rkm.service: Failed to execute command: No such file or directory Oct 19 17:40:40 systemd[37967]: rkm.service: Failed at step EXEC spawning /home/jayantha/rkm_stuff/net7.0/rkm: No such file or directory

I can help with this from the Linux side if needed

@hach-que is this still an issue? I checked on the rkm project and noticed nothing has been committed in over 7 months. If this is an issue, do you have specific requirements on vxlan or could something else be used?

Closing issue because it's going stale.

Hi @fady-azmy-msft @MikeZappa87, apologies for the non-response here.

RKM is pretty much on hold at the moment since containerd 1.7.0 / hcsshim broke the ability to mount virtual filesystem volumes in host-process containers (microsoft/hcsshim#1699). That's prevented me from upgrading our RKM-based cluster past containerd 1.6.x and blocked any further adoption of regular containers, since even host-process containers don't work on the newer versions.

Even if this issue was fixed in the kernel, practical usage of regular containers is also blocked on winfsp/winfsp#498, and there hasn't been progress on that issue over 6 months either.

So unfortunately I haven't had the time to try and replicate setting RKM up on a bunch of new boxes to see if I can replicate the issues you were having here. It's likely I won't get a chance to revisit RKM until mid-2024 at the earliest.