aws/amazon-cloudwatch-agent

Agent exists with code 1 instead of panic when configuration validation phase fails

rawahars opened this issue · 3 comments

Describe the bug
We are using the following to install Amazon CloudWatch agent on Windows hosts as specified in the Amazon CloudWatch docs. The following command is used-

& "C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1" -a fetch-config -m ec2 -s -c file:configuration-file-path

This script registers the CloudWatch agent as a Windows Service here. Ideally, whenever the agent crashes, Windows Service Manager (WSM) should restart the same. We assume that was the original intention and it works if the agent actually does crash.

In our use-case, we are running the same on an EC2 instance with the region being used in the config for the agent. However, when the instance boots up, IMDS is not available for few reasons. This causes the agent to assume that it is running in OnPrem environment and therefore it exits with code 1.

Since the agent stops with code 1, WSM assumes that the application stopped by itself and therefore, it never restarts the same. We think that the correct action would be for agent to exit with panic whenever there is any non-recoverable failure.

The logs we see are-

2023-09-21T05:56:23Z D! cloudwatch: publish routine receives the shutdown signal, exiting.
2023/09/21 16:08:10 I! D! [EC2] Found active network interface
E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available.
I! Detected the instance is OnPremise
2023/09/21 16:08:10 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json ...
C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2023/09/21 16:08:10 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\file_config.json ...
2023/09/21 16:08:10 I! Valid Json input schema.
Got Home directory: C:\Users\Administrator
I! Set home dir windows: C:\Users\Administrator
I! SDKRegionWithCredsMap region:  
Got Home directory: C:\Users\Administrator
2023/09/21 16:08:10 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:08:10 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:08:10 Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem
2023/09/21 16:08:10 Configuration validation first phase failed. Agent version: 1.0. Verify the JSON input is only using features supported by this version.
 
2023/09/21 16:08:10 I! Return exit error: exit code=1
2023/09/21 16:08:10 E! Cannot translate JSON, ERROR is exit status 1 
2023/09/21 16:09:21 I! D! [EC2] Found active network interface
E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available.
I! Detected the instance is OnPremise
2023/09/21 16:09:21 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json ...
C:\ProgramData\Amazon\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2023/09/21 16:09:21 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\file_config.json ...
2023/09/21 16:09:21 I! Valid Json input schema.
Got Home directory: C:\Users\Administrator
I! Set home dir windows: C:\Users\Administrator
I! SDKRegionWithCredsMap region:  
Got Home directory: C:\Users\Administrator
2023/09/21 16:09:21 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:09:21 E! Failed to generate TOML configuration validation content: [Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem]
2023/09/21 16:09:21 Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem
2023/09/21 16:09:21 Configuration validation first phase failed. Agent version: 1.0. Verify the JSON input is only using features supported by this version.
 
2023/09/21 16:09:21 I! Return exit error: exit code=1
2023/09/21 16:09:21 E! Cannot translate JSON, ERROR is exit status 1 

Steps to reproduce

  • Launch a Windows EC2 Instance
  • Delete the IMDS route using the following-
route delete 169.254.169.254 mask 255.255.255.255
  • Install Amazon CloudWatch agent on the instance as specified above
  • Use Start-Service AmazonCloudWatchAgent
  • The logs would show the statements as above
  • It will be observed that CloudWatch Agent will not be restarted.

What did you expect to see?
We expected that Windows Service Manager would try to restart the CloudWatch Agent service.

What did you see instead?
We saw in the CloudWatch Agent logs that the agent never restarted.

What version did you use?
Version:

What config did you use?

{
  "agent": {
    "debug": true
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "C:\\ProgramData\\containerd\\root\\panic.log*",
            "log_group_name": "containerd",
            "log_stream_name": "{instance_id}/containerd-daemon-panic",
            "timezone": "UTC"
          }
        ]
      }
    }
  },
  "metrics": {
    "namespace": "Default",
    "append_dimensions": {
      "ImageId": "${aws:ImageId}",
      "InstanceId": "${aws:InstanceId}"
    },
    "aggregation_dimensions": [
      [
        "InstanceId"
      ],
      []
    ],
    "metrics_collected": {
      "LogicalDisk": {
        "measurement": [
          {
            "name": "% Free Space",
            "unit": "Percent"
          }
        ],
        "resources": [
          "/",
          "C:\\ProgramData\\containerd"
        ]
      },
      "Memory": {
        "measurement": [
          {
            "name": "Available MBytes",
            "unit": "Megabytes"
          }
        ]
      },
      "statsd": {
        "metrics_aggregation_interval": 30,
        "metrics_collection_interval": 10,
        "service_address": ":8125"
      },
      "procstat": [
        {
          "exe": "containerd",
          "measurement": [
            "cpu_usage",
            "memory_rss"
          ]
        }
      ]
    }
  }
}

Environment
OS: Windows Server 2019 and Windows Server 2022

Hi @rawahars,

Thanks for reporting this issue. One workaround for the delayed IMDS availability on start up is to set the newly available imds_retries section (see #803 (comment)), which can potentially allow the agent to retry during start up until IMDS is up.

Changing the translator to panic instead of exiting with an exit code of 1 is a behavior change that can potentially impact existing customers in unexpected ways.