Regression in Windows 11 Build 22621.x - pod networking can not reach any destination
hach-que opened this issue · 20 comments
(This bug was originally posted to microsoft/SDN#563, but I've moved it here because this feels like a more appropriate avenue to report it)
I've been experimenting with getting Windows 11 to work as a Kubernetes node for dev/testing purposes, and I've run into an issue where I've found a regression between Build 22000 and Build 22621 where pods can no longer reach any destination (Internet, service addrs, other pod addrs).
Background
I have been working on a tool called RKM which does all the required set up for Kubernetes in development mode (without requiring VMs). It does all of the configuration, starts components, configures network, etc. etc.
Now I started out trying to get things working with Flannel VXLAN; I got Linux working fine but no luck on Windows - pods could not reach anywhere. After a little discussion in the sig-windows Slack, I was convinced to try setting up Calico instead. All of the testing results from here on out are from the calico branch of RKM (linked above).
For "Does not work", it means "pods can not reach the outside world or any other services or pods (the only thing that works is pinging 127.0.0.1)".
For "Works", it means "everything works, including DNS resolution of service names inside the Kubernetes cluster".
- Fresh Windows 11 22H2 22261.1265 installed in a VM: ❌ Does not work
- Checkpointed so I could restore each time to test.
- Installed from the ISO on this page https://www.microsoft.com/en-us/software-download, and allowed Windows Updates to run until it was considered "up-to-date".
- Fresh Windows 11 21H2 22000.194 installed in a VM: ✅ Works
- Again, checkpointed so I could restore each time to test.
- Installed using Rufus to get the 22000 ISO. Paused Windows updates to prevent any from installing.
- My existing dev machine (bare metal) Windows 11 21H1 22000.1574: ✅ Works
Reproduction Steps
The reproduction steps are fairly simple, but you will need a Linux VM to act as the Kubernetes master. All of these VMs should be on the same subnet (e.g. 10.7.0.0/16
).
Build RKM on the calico branch
- Clone RKM from the calico branch and build it. You can build this in Visual Studio; it's a .NET 7 console app so we'll just be using the same binaries on every machine.
- If you don't trust me, you can build this in a VM as well if you want to be extra safe.
- After building you should have a
net7.0
directory insidesrc\rkm-daemon\Debug
.
Set up the Linux VM
- Create an Ubuntu 20.04 LTS VM, with some reasonable resources (2 cores, 4GB RAM).
- You'll probably want to get SSH working so you can SSH/SCP into this box.
- Install .NET 7.
- Install the
conntrack
package. - Copy the RKM binaries to the Linux VM.
- As the root user, run
dotnet rkm-daemon.dll
and leave it running. It'll take a moment to start up all the components. - In a second shell or SSH session, you should be able to inspect the cluster by running
/opt/rkm/$(hostname)*/kubectl get pods --all-namespaces
. Once that's working and you can see the core pods, you're ready to move onto setting up the Windows VMs.
Set up the Windows VMs
You're going to set up two VMs here to contrast the build numbers. How you get 22000 and 22261 on the machines isn't very important (I'm sure internally Microsoft has easier access to ISOs than I do), as long as you get one machine on 22000.x and the other on 22261.x.
Then on each machine:
- Install the .NET 7 runtime.
- Copy the RKM binaries to each Windows VM.
- In an Administrative Command Prompt, run
dotnet rkm-daemon.dll
. - The first time this runs it will install the Windows Container feature and automatically reboot.
- After the reboot, open the Admin Command Prompt and run
dotnet rkm-daemon.dll
to continue the set up process.
Checking that everything is working
Back on the linux VM, you should now be able to run /opt/rkm/$(hostname)*/kubectl get nodes -o wide
and see output that looks similar to this (your hostnames will vary):
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
deadpool Ready <none> 82m v1.26.1 10.7.0.157 <none> Windows 10 Pro 10.0.22000.194 containerd://1.6.18+unknown
gambit Ready <none> 13h v1.26.1 10.7.0.156 <none> Windows 10 Pro 10.0.22621.1265 containerd://1.6.18+unknown
sentry Ready <none> 17h v1.26.1 10.7.0.32 <none> Ubuntu 20.04.5 LTS 5.4.0-139-generic containerd://1.6.18
Deploy the testing manifest
Copy this file to the Linux VM, and then apply it to the cluster with /opt/rkm/$(hostname)*/kubectl apply -f ...
:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx
namespace: default
spec:
selector:
matchLabels:
name: nginx
template:
metadata:
labels:
name: nginx
spec:
nodeSelector:
kubernetes.io/os: linux
containers:
- name: nginx
image: nginx
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
labels:
app: nginx
name: nginx
spec:
ports:
- name: nginx
port: 80
protocol: TCP
targetPort: 80
selector:
name: nginx
sessionAffinity: None
type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: busybox
namespace: default
spec:
selector:
matchLabels:
name: busybox
template:
metadata:
labels:
name: busybox
spec:
nodeSelector:
kubernetes.io/os: linux
containers:
- name: busybox
image: busybox
stdin: true
tty: true
terminationGracePeriodSeconds: 30
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: win-test-nano
namespace: default
spec:
selector:
matchLabels:
name: win-test-nano
template:
metadata:
labels:
name: win-test-nano
spec:
nodeSelector:
kubernetes.io/os: windows
containers:
- name: win-test-nano
image: mcr.microsoft.com/powershell:lts-7.2-nanoserver-ltsc2022
stdin: true
tty: true
terminationGracePeriodSeconds: 30
You can then get the names of the pods for further testing steps with: /opt/rkm/$(hostname)*/kubectl get pods -o wide
.
Test that Linux networking is working
This should pass easily, but as a sanity check, use /opt/rkm/$(hostname)*/kubectl attach -it <name of busybox pod>
and then run ping 1.1.1.1
. You should be able to get to the Internet.
You could also deploy an Ubuntu container and install curl
and test connectivity that way, but the Linux side of things is stable and doesn't really need further confirmation that it works beyond running a simple ping.
Compare Windows networking
Ok, time for the part where we actually see the problem. When you ran /opt/rkm/$(hostname)*/kubectl get pods -o wide
you will have seen which nodes each of the nano containers are running on. You need to pair this up with the kernel versions shown in /opt/rkm/$(hostname)*/kubectl get nodes -o wide
to know which pod is running under which kernel version.
For the pod that is running on 22000.x, use /opt/rkm/$(hostname)*/kubectl attach -it <pod name>
and then run the following commands:
curl -v https://1.1.1.1
curl -v https://google.com
curl -v http://nginx
This should all be working, and you should get responses.
For the pod that is running 22261.x, repeat the test and commands above. You will see that you just get timeouts and no connectivity, and thus you have reproduced the issue.
Workaround
There's no known workaround at the moment, because it's impossible to downgrade Windows 11 (and I'm not sure you can even hold off updates that long if Windows decides it wants to update).
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
Looking into this. I created an internal ticket (#44339550) for tracking.
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
Any update on the internal ticket for this? Would be nice to know if there is an ETA for a fix, since locking installs to 22000 and not upgrading is not ideal for various reasons.
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
As of 30/07/2023, locking a Windows installation to 22000 via the registry no longer works and you'll be force upgraded past the TargetReleaseVersionInfo
. Unfortunately it appears that this forced upgrade also bricks Windows Update and leaves the computer in a perpetual update/reboot loop that even System Restore can not recover from (and now I'm re-installing Windows, which by its nature means I will be on 22H2).
Is there any update on this issue @fady-azmy-msft?
This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.
@hach-que, would you be able to try reproducing this issue on Windows Server 2022? This scenario isn't supported on client skus
Windows Containers aren't supported on current Windows 11 clients? That's news to me.
Last I heard, the rational for not publishing 22000 and later client versions of https://hub.docker.com/_/microsoft-windows was that it was no longer necessary - because Microsoft was saying that ltsc2022 images would run on Windows 11 (#117 (comment)). If the ltsc2022 image is no longer compatible with Windows 11, does that mean we'll get a 22621 client image published to Docker for use with Windows 11?
@hach-que , you are correct Windows Containers are supported on windows 11 clients. I meant to say that overlay or l2bridge networking are both server specific scenarios. I'm assuming you are using one of these networking options with Flannel VXLAN.
How would we test Kubernetes in a development scenario without Calico? Are we expected to license Datacenter just to do development on Kubernetes Windows support?
Calico currently works on 22000 both server and client. There's really no practical reason for Calico not to work on 22621 - it's probably a pretty good indicator there's a regression in the kernel or networking components that would break Calico on Server 2025, which would thus mean it needs to get fixed anyway.
Hi @hach-que , I tried cloning your repo to try reproducing the issue. However, I get permission errors (while trying both the https link and the git link). I tried both ssh (by adding a public key) and an access token. I am probably doing something wrong. Is there a wiki you have which I can refer to be able to clone your repo?
Got the repo cloned. Built rkm in visual studio. Tried starting the dotnet application in Ubuntu VM. Getting the below error.
Oct 19 17:40:40 systemd[1]: rkm.service: Scheduled restart job, restart counter is at 1.
Oct 19 17:40:40 systemd[1]: Stopped RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine..
Oct 19 17:40:40 systemd[1]: Started RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine..
Oct 19 17:40:40 systemd[37967]: rkm.service: Failed to execute command: No such file or directory
Oct 19 17:40:40 systemd[37967]: rkm.service: Failed at step EXEC spawning /home/jayantha/rkm_stuff/net7.0/rkm: No such file or directory
@hach-que, If you would still like us debugging this, to start with, could you please share the output of (Get-HnsNetwork | convertto-json) from one of the Windows VMs?
Got the repo cloned. Built rkm in visual studio. Tried starting the dotnet application in Ubuntu VM. Getting the below error.
Oct 19 17:40:40 systemd[1]: rkm.service: Scheduled restart job, restart counter is at 1. Oct 19 17:40:40 systemd[1]: Stopped RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine.. Oct 19 17:40:40 systemd[1]: Started RKM (Redpoint Kubernetes Manager) runs Kubernetes on your local machine.. Oct 19 17:40:40 systemd[37967]: rkm.service: Failed to execute command: No such file or directory Oct 19 17:40:40 systemd[37967]: rkm.service: Failed at step EXEC spawning /home/jayantha/rkm_stuff/net7.0/rkm: No such file or directory
I can help with this from the Linux side if needed
@hach-que is this still an issue? I checked on the rkm project and noticed nothing has been committed in over 7 months. If this is an issue, do you have specific requirements on vxlan or could something else be used?
Closing issue because it's going stale.
Hi @fady-azmy-msft @MikeZappa87, apologies for the non-response here.
RKM is pretty much on hold at the moment since containerd 1.7.0 / hcsshim broke the ability to mount virtual filesystem volumes in host-process containers (microsoft/hcsshim#1699). That's prevented me from upgrading our RKM-based cluster past containerd 1.6.x and blocked any further adoption of regular containers, since even host-process containers don't work on the newer versions.
Even if this issue was fixed in the kernel, practical usage of regular containers is also blocked on winfsp/winfsp#498, and there hasn't been progress on that issue over 6 months either.
So unfortunately I haven't had the time to try and replicate setting RKM up on a bunch of new boxes to see if I can replicate the issues you were having here. It's likely I won't get a chance to revisit RKM until mid-2024 at the earliest.