Azure/Guest-Configuration-Extension

On long running VM sometimes guest assignments disappear from Azure portal, gc worker shuts down and never resumes

Opened this issue · 2 comments

I'm not 100% sure what causes this, but we've seen it in quite a few Azure virtual machines.

Something seems to be causing gc worker to think that there is nothing left to do and it just stops. All of the existing guest assignments eventually disappear from Azure portal because gc worker is no longer reporting compliance and the guest assignments expire. Even the 'AzureLinuxBaseline' guest assignments will disappear.

Here's what the tail end of the gc_worker.log looks like when this condition gets triggered:

[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GC_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Agent is waiting for running operations to finish. Current active operation count is 0.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GC_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] All the active operations are finished.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [CONSISTENCY_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Deleting consistency operation context.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [CONSISTENCY_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Consistency operation context deleted successfully.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GCCACHE_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] gc cache context deleted successfully.
[2024-04-23 12:52:05.235] [PID 3254671] [TID 3254671] [PSPROVIDER] [INFO] [00000000-0000-0000-0000-000000000000] Cleanup(). shutdownCoreClr() is successful

When this happens it is also impossible to restart gcd.service. it hangs, waiting for a process forever.

# systemctl status gcd
● gcd.service - GC Service
     Loaded: loaded (/lib/systemd/system/gcd.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) since Tue 2024-05-07 20:02:33 UTC; 6s ago
   Main PID: 919448 (bash)
      Tasks: 43 (limit: 19179)
     Memory: 271.0M
     CGroup: /system.slice/gcd.service
             ├─919448 /bin/bash /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.26.64/GCAgent/GC/run_service.sh
             └─919449 /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.26.64/GCAgent/GC/gc_linux_service

May 07 20:02:33 vm-dtb-aid-sg2-pr-02 systemd[1]: Stopping GC Service...
May 07 20:02:33 vm-dtb-aid-sg2-pr-02 bash[919448]: Stopping service process 919449
May 07 20:02:33 vm-dtb-aid-sg2-pr-02 bash[919448]: Waiting for service pid=919449
Warning: journal has been rotated since unit was started, output may be incomplete.

The only way I've found so far to recover from this, short of a reboot which we can't do whenever we want in a production environment, is to manually issue a kill on the process that it's waiting forever for.

I'm at a loss as to how to further troubleshoot this. If Microsoft would like more information on this or work with me to gather more information, I can make myself available.

In this specific instance the OS was Ubuntu 20.04 LTS (Azure marketplace image from Canonical)

I think I've found a workaround that might be feasible for us until this issue gets properly resolved -- creating an entry in /etc/cron.daily to restart gcd before it has a chance to reach the 'hung' state. So far that seems to have helped on the one VM where I've tried it.

I noticed that this also seems to happen when I delete guest assignments via Azure Portal, even if some guest assignments are remaining for the same virtual machine, and in theory the worker should still have work to do.

I see similar issues VERY OFTEN and have taken to restarting the Guest Configuration Extension Service (extensionservice) as a workaround, since this service is "hung" and not responding. As mentioned it will not shutdown in a timely fashion when remediating the issue.

In my situation I would add that these are ARC servers being patched with Azure Update Extension.

This happens with various Windows OSes from Windows 10 to Windows Server 22.

[1728+00000001] 9/10/2024 10:41:43 PM [Info] [ActivityId=00000000-0000-0000-0000-000000000000][MachineName=FLV-StratusCfg-W10] The extension is being called with valid argument, loggers are initialized. [Argument=Enable][ExtensionVersion=1.0.22.0][HandlerEnvironment=LogFolder: "C:\ProgramData\GuestConfig\extension_logs\Microsoft.SoftwareUpdateManagement.WindowsOsUpdateExtension", ConfigFolder: "C:\Packages\Plugins\Microsoft.SoftwareUpdateManagement.WindowsOsUpdateExtension\1.0.22.0\RuntimeSettings", StatusFolder: "C:\Packages\Plugins\Microsoft.SoftwareUpdateManagement.WindowsOsUpdateExtension\1.0.22.0\status", HeartbeatFile: "C:\Packages\Plugins\Microsoft.SoftwareUpdateManagement.WindowsOsUpdateExtension\1.0.22.0\status\HeartBeat.Json"][SequenceNumber=1]
[1728+00000001] 9/10/2024 10:41:45 PM [Info] [ActivityId=00000000-0000-0000-0000-000000000000][MachineName=FLV-StratusCfg-W10] The extension has read the input settings. [Argument=Enable][ExtensionVersion=1.0.22.0][SequenceNumber=1][InputSettings={
  "maximumDuration": 229,
  "classificationsToInclude": [
    "Critical",
    "Security",
    "UpdateRollup",
    "FeaturePack",
    "ServicePack",
    "Definition",
    "Tools",
    "Updates"
  ],
  "patchesToExclude": [
    "5034439",
    "5034441"
  ],
  "patchesToInclude": {
    "updates": [],
    "mode": 1
  },
  "rebootSettings": 1,
  "excludeKbsRequiringReboot": null,
  "maxPatchPublishDate": null,
  "skipInstallation": false,
  "activityId": "e0cd605f-ebc1-4b85-8be9-e8128693f61e",
  "startTime": "2024-09-09T23:05:25.506842Z",
  "action": "Patching"
}]
[1728+00000001] 9/10/2024 10:41:45 PM [Info] [ActivityId=e0cd605f-ebc1-4b85-8be9-e8128693f61e][MachineName=FLV-StratusCfg-W10] Enabling Handler: [SequenceNumber=1]
[1728+00000001] 9/10/2024 10:41:45 PM [Info] [ActivityId=e0cd605f-ebc1-4b85-8be9-e8128693f61e][MachineName=FLV-StratusCfg-W10] Checking if sequence number has changed. [SequenceNumber=1][HasSequenceNumberChanged=False]
[1728+00000001] 9/10/2024 10:41:46 PM [Info] [ActivityId=e0cd605f-ebc1-4b85-8be9-e8128693f61e][MachineName=FLV-StratusCfg-W10] Checking if previous job is running when the extension is enabled. [SequenceNumber=1][IsJobRunning=False]
[1728+00000001] 9/10/2024 10:41:46 PM [Info] [ActivityId=e0cd605f-ebc1-4b85-8be9-e8128693f61e][MachineName=FLV-StratusCfg-W10] Extension is enabled with previous sequence number and there is no current running job. Extension is exiting as no Operation. [SequenceNumber=1][PreviousJobStatuss=success]