aws/amazon-cloudwatch-agent

CloudWatchAgent on Windows fails with "imds retry client will retry 1 times"

YoungLee9853 opened this issue · 13 comments

Describe the bug
In EC2 UserData, I am trying to execute "& 'C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1' -a fetch-config -m ec2 -s -c file:C:\tmp\cloud-watch-agent-config.json" to install the cloud watch agent configuration on Windows Server 2022 host on start up. The UserData fails with "PS>TerminatingError(config-downloader.exe): "The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: 2023/09/29 01:07:56 I! imds retry client will retry 1 times" message.

Steps to reproduce

  1. Launch new Windows Server EC2 host with UserData including "& 'C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1' -a fetch-config -m ec2 -s -c file:C:\tmp\cloud-watch-agent-config.json""
  2. Wait for the user data script to fail...
  • To see the exact failure message, you will need to trap the exception and keep the transcript.

What did you expect to see?
AmazonCloudWatchAgent boots up with the provided configuration.

What did you see instead?
User data script fails.

What version did you use?
Version: (e.g., v1.247350.0, etc)

What config did you use?
Config: (e.g. the agent json config file)

Environment
Windows Server 2022

Additional context
Add any other context about the problem here.

Looking at previous issues, it seems like the issue is related to #516 .

Related - https://github.com/aws/amazon-cloudwatch-agent/blob/main/internal/retryer/imdsretryer.go#L30

Can you post your ami. We have tests for win-2022 where the agent does start.

Same here. This worked last week.

AMI ID - ami-00c896faf296575ab

$config = @"
{
    "logs": {
        "logs_collected": {
            "windows_events": {
                "collect_list": [
                    {
                        "event_format": "xml",
                        "event_levels": [
                            "VERBOSE",
                            "INFORMATION",
                            "WARNING",
                            "ERROR",
                            "CRITICAL"
                        ],
                        "event_name": "Application",
                        "log_group_name": "/my/logs",
                        "log_stream_name": "{instance_id}/event-logs/application"
                    }
                ]
            }
        }
    }
}
"@

$installDirectory = "c:\temp\cw"
$downloadDirectory = $installDirectory 
$logsDirectory = $installDirectory 
    
New-Item -ItemType "directory" -Path $installDirectory

Set-Location -Path $installDirectory

$config | Set-Content -Path "$installDirectory/config.json"

Write-host "Installing Cloudwatch Agent"
$cwAgentInstaller = "$downloadDirectory\amazon-cloudwatch-agent.msi"
$cwAgentInstallPath = "C:\Program Files\Amazon\AmazonCloudWatchAgent"
Invoke-WebRequest "https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/amazon-cloudwatch-agent.msi" -OutFile $cwAgentInstaller
Start-Process -FilePath msiexec -Args "/i $cwAgentInstaller /l*v $logsDirectory\installCWAgentLog.log /qn" -Verb RunAs -Wait

Write-host "Load config"
& "$cwAgentInstallPath\amazon-cloudwatch-agent-ctl.ps1" -a fetch-config -m ec2 -s -c file:"$installDirectory/config.json"

Output:

Load config
****** processing amazon-cloudwatch-agent ******
I! Trying to detect region from ec2
config-downloader.exe : 2023/10/02 11:52:44 I! imds retry client will retry 1 times
At C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1:304 char:9
+         & $CWAProgramFiles\config-downloader.exe --output-dir "${JSON ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (2023/10/02 11:5...l retry 1 times:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
4wuyan commented

I am also seeing the same.

I noticed:

  1. C:\'Program Files'\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -a fetch-config -m ec2 -c file:\temp\cloudwatch-agent-config.json works fine when I later connect to the EC2 and run this command manually. But it will fail when running as a part of user data.
  2. It only starts to happen this week. The same AMI that used to be fine last week is not ok this week. When I inspect the good EC2 from last week, I find the cloudwatch agent downloaded from https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/amazon-cloudwatch-agent.msi in my user data script is a different version.

Last week, https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/amazon-cloudwatch-agent.msi returns Amazon CloudWatch Agent 1.300026.3 (2023-08-21), while this week, it's Amazon CloudWatch Agent 1.300028.1 (2023-09-18). Hence, I suspect it's an issue related to the latest version(s) of cloudwatch agent.

PS C:\Program Files\Amazon\AmazonCloudWatchAgent> cat .\CWAGENT_VERSION
1.300026.3b189

PS C:\Program Files\Amazon\AmazonCloudWatchAgent> cat .\RELEASE_NOTES -head 20
========================================================================
Amazon CloudWatch Agent 1.300026.3 (2023-08-21)
========================================================================

Bug fixes:
* Fix credential chain for new components
* Fix metric renaming for Windows Performance Counters
* Fix log stream name translation for EMF on ECS
* Reduce RPM installation time

========================================================================
Amazon CloudWatch Agent 1.300026.2 (2023-08-10)
========================================================================

Bug fixes:
* Fix EMF log corruption when multiple clients are sending concurrently
* Drop invalid EMF logs
* Allow environment variables in OTEL config
* Revert credential chain when running as a service to prioritize instance role
PS C:\Program Files\Amazon\AmazonCloudWatchAgent> cat .\CWAGENT_VERSION
1.300028.1b210

PS C:\Program Files\Amazon\AmazonCloudWatchAgent> cat .\RELEASE_NOTES -head 20
========================================================================
Amazon CloudWatch Agent 1.300028.1 (2023-09-18)
========================================================================

Bug fixes:
* Fix windows event logs to start only once

========================================================================
Amazon CloudWatch Agent 1.300028.0 (2023-09-11)
========================================================================

Bug fixes:
* Fix file pattern matching to support glob wildcard characters (!{})
* Use LogStreamName instead of ServiceName in token replacement for Prometheus
* Add fallback shared config files for credential ordering to maintain previous AWS SDK behavior
* Drop unsupported NaN, Inf, and out of range values

Enhancements:
* Try using IMDSv2 only first before using client with fallback
* Add support for configurable IMDS retries in the common-config.toml

References:

I put this together to demonstrate the problem and assist with troubleshooting.

https://github.com/ryanwilliams83/CloudWatchAgent-871

image

MSI Package Version 1.4.37882 (https://github.com/ryanwilliams83/CloudWatchAgent-871/raw/main/assets/amazon-cloudwatch-agent-1.4.37882.msi)
image

MSI Package Version 1.4.37884 + latest
image

@AllanBenson001

Is this issue only when running this command in user data or when running in powershell. I ran this command in powershell after agent started and it worked. I want to make this issue only happens when starting with user data.

Okay so I was able to reproduce and fix the issue. Can you please take this version of the agent until we finish the patch release.

Closing the issue since root cause is found, the issue is mitigated.

4wuyan commented

Thanks for the effort! Glad to see it's fixed.

BTW, for those who are interested in why it happens only in user data execution, but not in interactive powershell:

I believe it's a long lasting issue when powershell redirects error stream.

User data is executed in this way, with both 1> and 2>:

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -Command {
  $env:EC2Launch_Execution_Mode = 'attached';
  . 'C:\Windows\system32\config\systemprofile\AppData\Local\Temp\EC2Launch123\UserScript.ps1' 1> 'C:\Windows\system32\config\systemprofile\AppData\Local\Temp\EC2Launch123\output.tmp' 2> 'C:\Windows\system32\config\systemprofile\AppData\Local\Temp\EC2Launch123\err.tmp';
  exit $LASTEXITCODE
}

And I do find if I manually run it in powershell with 2>, it fails too.

PS C:\temp> C:\'Program Files'\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -a fetch-config -m ec2 -c file:\temp\cloudwatch-agent-config.json 2>tmp
****** processing amazon-cloudwatch-agent ******
I! Trying to detect region from ec2
D! [EC2] Found active network interface
Successfully fetched the config and saved in C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\file_cloudwatch-agent-config.json.tmp
config-downloader.exe : 2023/10/05 12:41:18 I! imds retry client will retry 1 times
At C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1:304 char:9
+         & $CWAProgramFiles\config-downloader.exe --output-dir "${JSON ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (2023/10/05 12:4...l retry 1 times:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

When will this fix be released so that I can start using the latest version again?

ymtaye commented

The fix for this issue is currently released in the latest CloudWatch Agent, please try retrieving it by using the link below. Thanks!
https://amazoncloudwatch-agent.s3.amazonaws.com/windows/amd64/latest/amazon-cloudwatch-agent.msi