microsoft/Windows-Containers

mcr.microsoft.com/windows/servercore:10.0.20348.2340 BSODs Windows 2022 10.0.20348.1726, mcr.microsoft.com/windows/servercore:10.0.20348.1970 does not

doctorpangloss opened this issue · 17 comments

Describe the bug
The latest mcr.microsoft.com/windows/servercore:10.0.20348.2340 BSODs (crashes) a Windows 2022 10.0.20348.1726 host.

This bug is about the latest images running on a non-latest host.

PS C:\Users\Administrator> Get-WinEvent -FilterHashtable @{LogName='System'; Id=1001; StartTime=[datetime]::Today} |
>>     ForEach-Object {
>>         [PSCustomObject]@{
>>             TimeCreated = $_.TimeCreated
>>             ProviderName = $_.ProviderName
>>             EventID = $_.Id
>>             DiagnosticID = $_.Properties[2].Value
>>             Message = ($_.Properties[0].Value -join " ")
>>         }
>>     } | Format-Table -AutoSize

TimeCreated           ProviderName                               EventID DiagnosticID                         Message
-----------           ------------                               ------- ------------                         -------
5/23/2024 10:00:38 PM Microsoft-Windows-WER-SystemErrorReporting    1001 fed6e402-0998-4987-a650-fda41d8ca074 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8007f90b8ba)
5/23/2024 8:42:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 1122c493-02e4-4ad6-95a4-ceae345a1f2d 0x0000000a (0x0000029985891047, 0x0000000000000002, 0x0000000000000001, 0xfffff8063d30b8ba)
5/23/2024 8:26:54 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 9bdd89a7-57a3-435f-8854-c350170bcd70 0x0000003b (0x00000000c0000005, 0xfffff80134d0b8ba, 0xffffe60059073900, 0x0000000000000000)
5/23/2024 8:23:22 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 e5b56d98-e475-4976-8c5c-692334d986c1 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8040cf0b8ba)
5/23/2024 8:04:51 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 bf75bc45-200e-407e-8611-02f55d16a4db 0x0000001e (0xffffffffc0000005, 0xfffff8041b828777, 0x0000000000000000, 0xffffffffffffffff)
5/23/2024 8:02:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 3aaf6864-d78e-4c3f-9ad6-001b6e2552c6 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8064a50b8ba)
5/23/2024 7:54:08 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 f1dee8f5-52da-4899-a782-56e1aac43847 0x0000000a (0x00000000005fe047, 0x0000000000000002, 0x0000000000000001, 0xfffff80738f0b8ba)
5/23/2024 7:49:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 c96ec495-4215-4def-903b-48262b4ef468 0x0000003b (0x00000000c0000005, 0xfffff8052830b8ba, 0xffffd380c81f3900, 0x0000000000000000)
Diagnostic ID Explanation
fed6e402-0998-4987-a650-fda41d8ca074 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
1122c493-02e4-4ad6-95a4-ceae345a1f2d 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
9bdd89a7-57a3-435f-8854-c350170bcd70 0x0000003b: SYSTEM_SERVICE_EXCEPTION (General system error)
e5b56d98-e475-4976-8c5c-692334d986c1 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
bf75bc45-200e-407e-8611-02f55d16a4db 0x0000001e: KMODE_EXCEPTION_NOT_HANDLED (Kernel mode error)
3aaf6864-d78e-4c3f-9ad6-001b6e2552c6 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
f1dee8f5-52da-4899-a782-56e1aac43847 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
c96ec495-4215-4def-903b-48262b4ef468 0x0000003b: SYSTEM_SERVICE_EXCEPTION (General system error)
STACK_TEXT:
ffff9009`8a89e368 fffff800`7fa33a69 : 00000000`0000000a 00000000`0000004c 00000000`00000002 00000000`00000001 : nt!KeBugCheckEx
ffff9009`8a89e370 fffff800`7fa2f24c : ffff9009`8a89e800 00000000`00000000 ffff940c`7c15e478 fffff800`7f86526c : nt!setjmpex+0x9269
ffff9009`8a89e4b0 fffff800`7f90b8ba : ffffa98e`5a058a20 fffff800`7f84faeb 00000000`00000000 00000000`00000024 : nt!setjmpex+0x4a4c
ffff9009`8a89e640 fffff800`7f96394a : 00000000`00000000 ffff9009`00000000 00000000`00000000 ffffa98e`5a058a80 : nt!ExTryAcquireSpinLockExclusiveAtDpcLevel+0x3a
ffff9009`8a89e670 fffff800`7f963897 : 00000000`00000000 00000000`00000008 ffffa98e`77664520 ffff9009`8a89ea90 : nt!FsRtlChangeBackingFileObject+0xca
ffff9009`8a89e6b0 fffff800`81ea8c95 : 00000000`00000000 00000000`00000000 ffffa98e`5a0581b0 ffff9009`8a89ea90 : nt!FsRtlChangeBackingFileObject+0x17
ffff9009`8a89e6e0 fffff800`81ea2792 : ffffa98e`757a4010 ffff9009`8a89ea90 ffffa98e`757a4010 00000000`00000000 : Ntfs+0xe8c95
ffff9009`8a89e980 fffff800`7f9031f5 : ffffa98e`5a058030 ffffa98e`757a4010 ffff9009`8a89ec00 ffffa98e`77664520 : Ntfs+0xe2792
ffff9009`8a89ec00 fffff800`7b7767df : ffffa98e`77664500 ffff9009`8a89ecf0 ffff9009`8a89ecf9 fffff800`7b775463 : nt!IofCallDriver+0x55
ffff9009`8a89ec40 fffff800`7b7a95e4 : ffff9009`8a89ecf0 ffffa98e`757a43f8 ffffa98e`59c7fd20 00000000`00000000 : FLTMGR!FltIsCallbackDataDirty+0x40f
ffff9009`8a89ecb0 fffff800`7f9031f5 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltQueryInformationFile+0x9c4
ffff9009`8a89ed60 fffff800`7fcff276 : ffffa98e`757a4440 00000000`00000000 ffff9009`8a89f001 00000000`00001040 : nt!IofCallDriver+0x55
ffff9009`8a89eda0 fffff800`7fd98887 : 00000000`00000000 ffffa98e`5e933a20 a98e6562`4490d2bd ffffa98e`656244c0 : nt!SePrivilegeCheck+0x1a76
ffff9009`8a89ef60 fffff800`7fc5b215 : fffff800`7fd987c0 ffff9009`8a89f0d0 ffffa98e`551f9400 ffffa98e`656244c0 : nt!NtSetSecurityObject+0xab7
ffff9009`8a89efd0 fffff800`7fc5a6b1 : 00000000`00000000 ffff9009`8a89f200 00000000`00001040 ffffa98e`551f9400 : nt!ObOpenObjectByNameEx+0xd55
ffff9009`8a89f170 fffff800`7fcd0cc1 : 00000000`00000000 00000000`00000000 ffffa98e`5e933a20 000000c0`02299d58 : nt!ObOpenObjectByNameEx+0x1f1
ffff9009`8a89f2a0 fffff800`7fcd0469 : 000000c0`02299d10 00000000`00100080 000000c0`02299d58 000000c0`02299d20 : nt!NtCreateFile+0x8d1
ffff9009`8a89f360 fffff800`7fa33185 : 00000000`00000126 000000c0`00bc56c0 000000c0`00680000 00000000`00000000 : nt!NtCreateFile+0x79
ffff9009`8a89f3f0 00007ff8`5005ff14 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!setjmpex+0x8985
00000043`8c3ffaf8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ff8`5005ff14

If you tell me how to invoke kd in a way that downloads symbols I can show the whole memory dump.

To Reproduce

This deployment will cause a blue screen:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: x
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: x
  template:
    metadata:
      labels:
        app: x
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      containers:
        - name: some-container
          image: mcr.microsoft.com/windows/servercore:10.0.20348.2340
          securityContext:
            windowsOptions:
              runAsUserName: ContainerAdministrator
          command:
            - "C:/Windows/System32/WindowsPowerShell/v1.0/powershell.exe"
            - "-Command"
          args:
            - |
              $test = "1"

on a Windows 10.0.20348.1726 node.

On Windows 2022 10.0.20348.2227 + Docker, this does not reproduce.

Expected behavior
It shouldn't crash.

Configuration:

  • Edition: Windows Server 2022 Data Center
  • Base Image being used: Windows Server Core
  • Container engine: containerd 1.7.16 (no impact compared to 1.7.0)
  • kubernetes 1.26.2

Additional context
I am using this version of Windows due to projectcalico/calico#8529 and cannot update until it is fixed.

Just pointing out that this appears to be an issue when using older WS host and new WS image layers. I missed that the first time I read through it.

In our case we are seeing BSOD with Windows Server 2022 Worker Nodes on build 10.0.20348.2461 running a PowerShell image built from 10.0.20348.2322 (the tag below corresponds to sha256:fef9ce2b93ad3b09bd51f60bba3476fafd4d9dc46260de9aff6e5aff4bd142f5).

This implies this also affects hosts where WS build version is greater than image WS build version!

To reproduce this, on a worker node that isn't running any other containers(this somehow makes BSOD more likely), create a pod with the following container spec:

      containers:
      - image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
        imagePullPolicy: IfNotPresent
        command: ["pwsh.exe"]
        args: ["-noprofile", "-noninteractive", "-executionpolicy", "bypass", "-command", "[System.Threading.Thread]::Sleep([System.Threading.Timeout]::Infinite)"]
        name: node-ds
        resources:
          limits:
            cpu: 1
            ephemeral-storage: 512M
            memory: 1G
          requests:
            cpu: 250m
            ephemeral-storage: 128M
            memory: 512M
        securityContext:
          runAsNonRoot: true

The worker node does not crash immediately but after a few days.

@doctorpangloss and @avin3sh, I wonder if you could share the crash dump files with me. Thanks!

Please advise how to send the crash dumps, I appreciate investigating this issue.

I have the following instructions for sharing a large file through Azure storage. Please let me know if this approach works for you:

How to Use Azure Blob Storage (1).docx

Thank you!

I have the following instructions for sharing a large file through Azure storage. Please let me know if this approach works for you:

How to Use Azure Blob Storage (1).docx

Thank you!

thanks for this. memory dumps from this machine will contain sensitive information. I have an AWS presigned URL I can share, what's the best way to send it to you?

Thank you for sharing the dump file. I’ll keep you posted.