aws/amazon-cloudwatch-agent

Sometimes at service startup logs agent gets permission denied

jfharden opened this issue · 12 comments

Describe the bug
Sometimes when starting up the logs agent fails to read files with permission denied. Restarting the service repeatedly will sometimes get permission denied errors, and sometimes run ok.

Steps to reproduce
I can't perfectly recreate but our setup is:

Using the amazon-linux-2 AMI amzn2-ami-hvm-2.0.20231020.1-x86_64-gp2 (currently ami-0a123353df8e77189 in eu-west-1). Launch an instance with the following user data:

#!/bin/bash

set -euo pipefail

export AWS_REGION=eu-west-1
export AWS_DEFAULT_REGION=eu-west-1

yum update --security -y

yum install amazon-cloudwatch-agent -y

# Add CW Agent to adm group and grant access to read logs
usermod -a -G adm cwagent
setfacl -d -m g:adm:r-x /var/log
setfacl -R -d -m g:adm:r-- /var/log/*
setfacl -m g:adm:r-x /var/log
setfacl -R -m g:adm:r-- /var/log/*

systemctl stop amazon-cloudwatch-agent

cat <<EOF > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages*",
            "log_group_name": "<REDACTED>",
            "log_stream_name": "{hostname}/messages",
            "timestamp_format": "%b %-d %H:%M:%S"
          },
          {
            "file_path": "/var/log/secure*",
            "log_group_name": "<REDACTED>",
            "log_stream_name": "{hostname}/secure",
            "timestamp_format": "%b %-d %H:%M:%S"
          },
          {
            "file_path": "/var/log/dmesg",
            "log_group_name": "<REDACTED>",
            "log_stream_name": "{hostname}/dmesg"
          },
          {
            "file_path": "/var/log/yum.log*",
            "log_group_name": "<REDACTED>",
            "log_stream_name": "{hostname}/yum",
            "timestamp_format": "%b %-d %H:%M:%S"
          }
        ]
      }
    }
  }
}
EOF

systemctl start amazon-cloudwatch-agent
systemctl enable amazon-cloudwatch-agent

# If we get here, we didn't fail, and can mark this machine as successfully provisioned!
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d' ' -f2)
aws ec2 create-tags --resources "$INSTANCE_ID" --tags 'Key="provisioned",Value="true"'

echo "cloud-init provisioning complete - Rebooting"

reboot now

What did you expect to see?
The server startup and cloudwatch logs agent shipping logs

What did you see instead?
The logs agent sometimes receives permission denied on the log files:

2023-11-01T23:02:23Z I! Starting AmazonCloudWatchAgent CWAgent/1.300028.1 (go1.20.7; linux; amd64)
2023-11-01T23:02:23Z I! AWS SDK log level not set
2023-11-01T23:02:23Z I! creating new logs agent
2023-11-01T23:02:23Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"<REDACTED>.eu-west-1.compute.internal", Flush Interval:1s
2023-11-01T23:02:23Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log*]
2023-11-01T23:02:23Z I! [inputs.logfile] turned on logs plugin
2023-11-01T23:02:23Z I! [logagent] starting
2023-11-01T23:02:23Z I! [logagent] found plugin cloudwatchlogs is a log backend
2023-11-01T23:02:23Z I! [logagent] found plugin logfile is a log collection
2023-11-01T23:02:23Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log*]
2023-11-01T23:02:23Z I! [inputs.logfile] turned on logs plugin
2023-11-01T23:02:24Z I! [logagent] piping log from <REDACTED>/<REDACTED>.eu-west-1.compute.internal/messages(/var/log/messages) to cloudwatchlogs with retention -1
2023-11-01T23:02:24Z I! [logagent] piping log from<REDACTED>/<REDACTED>..compute.internal/secure(/var/log/secure) to cloudwatchlogs with retention -1
2023-11-01T23:02:24Z I! [logagent] piping log from <REDACTED>/<REDACTED>..eu-west-1.compute.internal/dmesg(/var/log/dmesg) to cloudwatchlogs with retention -1
2023-11-01T23:02:24Z I! [logagent] piping log from <REDACTED>/<REDACTED>..eu-west-1.compute.internal/yum(/var/log/yum.log) to cloudwatchlogs with retention -1
2023-11-01T23:02:30Z I! Profiler is stopped during shutdown
2023-11-01T23:02:30Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-11-01T23:02:30Z I! [agent] Stopping running outputs
2023/11/01 23:02:51 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2023/11/01 23:02:51 D! config [agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.logfile]]
    destination = "cloudwatchlogs"
    file_state_folder = "/opt/aws/amazon-cloudwatch-agent/logs/state"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/messages*"
      from_beginning = true
      log_group_name = "<REDACTED>"
      log_stream_name = "<REDACTED>.eu-west-1.compute.internal/messages"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/secure*"
      from_beginning = true
      log_group_name = "<REDACTED>"
      log_stream_name = "<REDACTED>.eu-west-1.compute.internal/secure"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/dmesg"
      from_beginning = true
      log_group_name = "<REDACTED>"
      log_stream_name = "<REDACTED>.eu-west-1.compute.internal/dmesg"
      pipe = false
      retention_in_days = -1

    [[inputs.logfile.file_config]]
      file_path = "/var/log/yum.log*"
      from_beginning = true
      log_group_name = "<REDACTED>"
      log_stream_name = "<REDACTED>.eu-west-1.compute.internal/yum"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

[outputs]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "<REDACTED>"
    region = "eu-west-1"
2023/11/01 23:02:51 I! Config has been translated into YAML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml
2023/11/01 23:02:51 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2023/11/01 23:02:51 I! Valid Json input schema.
2023/11/01 23:02:51 I! Detected runAsUser: cwagent
2023/11/01 23:02:51 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 995:993
2023/11/01 23:02:51 I! Set HOME: /home/cwagent
2023-11-01T23:02:52Z I! Starting AmazonCloudWatchAgent CWAgent/1.300028.1 (go1.20.7; linux; amd64)
2023-11-01T23:02:52Z I! AWS SDK log level not set
2023-11-01T23:02:52Z I! creating new logs agent
2023-11-01T23:02:52Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-172-18-100-24.eu-west-1.compute.internal", Flush Interval:1s
2023-11-01T23:02:52Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log*]
2023-11-01T23:02:52Z I! [inputs.logfile] turned on logs plugin
2023-11-01T23:02:52Z I! [logagent] starting
2023-11-01T23:02:52Z I! [logagent] found plugin cloudwatchlogs is a log backend
2023-11-01T23:02:52Z I! [logagent] found plugin logfile is a log collection
2023-11-01T23:02:52Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log*]
2023-11-01T23:02:52Z I! [inputs.logfile] turned on logs plugin
2023-11-01T23:02:53Z I! [inputs.logfile] Reading from offset 103961 in /var/log/messages
2023-11-01T23:02:53Z E! [inputs.logfile] Failed to tail file /var/log/messages with error: open /var/log/messages: permission denied
2023-11-01T23:02:53Z I! [inputs.logfile] Reading from offset 1604 in /var/log/secure
2023-11-01T23:02:53Z E! [inputs.logfile] Failed to tail file /var/log/secure with error: open /var/log/secure: permission denied
2023-11-01T23:02:53Z I! [inputs.logfile] Reading from offset 27293 in /var/log/dmesg
2023-11-01T23:02:53Z I! [inputs.logfile] Reading from offset 892 in /var/log/yum.log
2023-11-01T23:02:53Z E! [inputs.logfile] Failed to tail file /var/log/yum.log with error: open /var/log/yum.log: permission denied

If I restart the service sometimes it will be able to open the files, sometimes not. I tried following the process with strace, but it only shows a simple permission denied error on the log files in question (as reflected by the error log above)

What version did you use?
amazon-cloudwatch-agent RPM package 1.300028.1-1.amzn2 although it's been happening for a while

What config did you use?
See above in user data and log output

Environment
Amazon Linux 2

Additional context
We're controlling it's access using an acl (as seen in the user data script), adding cwagent user to the adm group, and giving the adm group read access with the setfacl commands.

amazon-cloudwatch-agent RPM package 1.300028.1-1.amzn2 although it's been happening for a while

Do you know when it started happening?

We're controlling it's access using an acl (as seen in the user data script), adding cwagent user to the adm group, and giving the adm group read access with the setfacl commands.

It's strange that it's sporadic behavior. Is it possible that the files within the /var/log directory are created after the ACL is set? Would they respect the ACL set on the directory?

amazon-cloudwatch-agent RPM package 1.300028.1-1.amzn2 although it's been happening for a while

Do you know when it started happening?

I think sometime in July, but I can't be sure, there's been a bit of a habit of "restart the server and it will be fine" and I'm only really looking into it now.

We're controlling it's access using an acl (as seen in the user data script), adding cwagent user to the adm group, and giving the adm group read access with the setfacl commands.

It's strange that it's sporadic behavior. Is it possible that the files within the /var/log directory are created after the ACL is set? Would they respect the ACL set on the directory?

It is possible they are created after the ACL was set, but the acl definitely applies to the files currently there:

$ getfacl messages
# file: messages
# owner: root
# group: root
user::rw-
group::---
group:adm:r--
mask::r--
other::---

If I restart the agent a few times, sometimes it will work, and sometimes it wont, but in all cases the /var/log/messages file is the same file (confirmed by checking the inode number of the file is unchanging)

If possible can you try the following cases so we can gather more insight about the issue:

  1. Can you remove this line from your configuration and start agent with root access. This will let us understand if the issue is
    related cwagent user configuration or agent related.
    "run_as_user": "cwagent"
    
  2. Can you create a new amazon-linux 2 instance and run cloudwatch-agent with default agent config(only keeping the log configuration from your config file)
    Thank you for your collaboration

This happened again this morning in our staging environment so I took the opportunity to try 1. With the runas cwagent removed it never has the permission denied error, and in the log I can see it's deciding to start as root (and a ps shows it running as root)

2023/11/21 10:14:07 I! Detected runAsUser: root

I'll try 2. in the next day or so

Ok, I tried to, my user data is identical to before, but with this cwagent config:

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages*",
            "log_group_name": "jfharden-experiment",
            "log_stream_name": "{hostname}/messages",
            "timestamp_format": "%b %-d %H:%M:%S"
          },
          {
            "file_path": "/var/log/secure*",
            "log_group_name": "jfharden-experiment",
            "log_stream_name": "{hostname}/secure",
            "timestamp_format": "%b %-d %H:%M:%S"
          },
          {
            "file_path": "/var/log/dmesg",
            "log_group_name": "jfharden-experiment",
            "log_stream_name": "{hostname}/dmesg"
          },
          {
            "file_path": "/var/log/yum.log*",
            "log_group_name": "jfharden-experiment",
            "log_stream_name": "{hostname}/yum",
            "timestamp_format": "%b %-d %H:%M:%S"
          },
          {
            "file_path": "/var/log/audit/audit.log*",
            "log_group_name": "jfharden-experiment",
            "log_stream_name": "{hostname}/audit"
          }
        ]
      }
    }
  }
}

With this config it works every time, although give that it's running as root I am not surprised. It does make me think it's more likely to be an amazon linux problem, rather than a cloudwatch agent problem. I think i will try upgrading to amazon linux 2023 next, and will report back on my findings

Going down the amazon linux 2023 path is painful due to #382 but I will still persist and try to replicate in my dev env, even if I don't end up shipping to production

Having tired amazon linux 2023 the errors remain when using the cwagent user and a file acl.

I'm back to thinking it's to do with the log agent now.

I just modified my user data to completely ditch the file acls and just add cwagent to the adm group, and make the logs have group adm with group read (I put the effort into making sure log rotate was ok too) and the permission denied errors using just basic group permissions still generates the same error

My user data now instead of the facl comamnds has

# Add create files as group adm directive to logrotate config if it's not there
if ! grep "create" /etc/logrotate.d/syslog; then
  sed -i -E 's/\{/{\n    create 640 root adm/' /etc/logrotate.d/syslog
fi
if ! grep "create 0640 root adm" /etc/logrotate.d/yum; then
  sed -i -E 's/create 0600 root root/create 640 root adm/' /etc/logrotate.d/yum
fi

# Add CW agent to adm group and change some logs to have adm group
usermod -a -G adm cwagent

# The audit logs already have adm group
chgrp adm \
  /var/log/messages \
  /var/log/secure \
  /var/log/dmesg \
  /var/log/yum.log

# Add group read access (audit logs already have it)
chmod g+r \
  /var/log/messages \
  /var/log/secure \
  /var/log/dmesg \
  /var/log/yum.log

# Use the existing systemd dmesg startup and append mod and group changes to the newly created file
cp /usr/lib/systemd/rhel-dmesg /etc/systemd/rhel-dmesg
cat >>/etc/systemd/rhel-dmesg <<EOF
chmod g+r /var/log/dmesg
chgrp adm /var/log/dmesg
EOF

# Override the systemd service config for rhel-dmesg to execute our script created above
cp /usr/lib/systemd/system/rhel-dmesg.service /etc/systemd/system/rhel-dmesg.service
sed -i -E 's@^ExecStart.*$@ExecStart=/etc/systemd/rhel-dmesg@' /etc/systemd/system/rhel-dmesg.service

The permission in /var/log

# ls -la /var/log/
total 468
drwxr-xr-x   7 root   root              4096 Nov 23 16:37 .
drwxr-xr-x  19 root   root               269 Nov 23 16:36 ..
drwxr-xr-x   3 root   root                48 Nov 23 16:37 amazon
drwxr-x---   2 root   adm                 23 Nov 23 16:36 audit
-rw-------   1 root   root             17054 Nov 23 16:37 boot.log
-rw-------   1 root   utmp                 0 Nov 16 23:33 btmp
drwxr-x---   2 chrony chrony              72 Nov 23 16:36 chrony
-rw-r--r--   1 root   root            183022 Nov 23 16:37 cloud-init.log
-rw-r-----   1 root   root              7945 Nov 23 16:37 cloud-init-output.log
-rw-------   1 root   root               541 Nov 23 16:40 cron
-rw-r--r--   1 root   adm              27271 Nov 23 16:37 dmesg
-rw-r--r--   1 root   adm              27305 Nov 23 16:36 dmesg.old
-rw-r--r--   1 root   root               193 Nov 16 23:32 grubby_prune_debug
drwxr-sr-x+  3 root   systemd-journal     46 Nov 23 16:36 journal
-rw-r--r--   1 root   root            292584 Nov 23 16:44 lastlog
-rw-------   1 root   root               582 Nov 23 16:37 maillog
-rw-r-----   1 root   adm             153927 Nov 23 16:44 messages
drwxr-xr-x   2 root   root                18 Nov 23 16:36 sa
-rw-r-----   1 root   adm               2836 Nov 23 16:44 secure
-rw-------   1 root   root                 0 Nov 16 23:33 spooler
-rw-------   1 root   root                 0 Nov 16 23:32 tallylog
-rw-rw-r--   1 root   utmp              6144 Nov 23 16:42 wtmp
-rw-r-----   1 root   adm                 77 Nov 23 16:37 yum.log

And in /var/log/audit

# ls -la /var/log/audit/
total 2116
drwxr-x--- 2 root adm       23 Nov 23 16:36 .
drwxr-xr-x 7 root root    4096 Nov 23 16:37 ..
-rw-r----- 1 root adm  1332992 Nov 23 16:44 audit.log

WIth the same cwagent config (with runas cwagent) and the symptoms are exactly the same. Sometimes restarting the daemon it receives permission denied

2023/11/23 16:39:42 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2023/11/23 16:39:42 D! config [agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.logfile]]
    destination = "cloudwatchlogs"
    file_state_folder = "/opt/aws/amazon-cloudwatch-agent/logs/state"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/messages*"
      from_beginning = true
      log_group_name = "jfharden-experiment__bastion"
      log_stream_name = "ip-10-0-1-120.eu-west-1.compute.internal/messages"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/secure*"
      from_beginning = true
      log_group_name = "jfharden-experiment__bastion"
      log_stream_name = "ip-10-0-1-120.eu-west-1.compute.internal/secure"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/dmesg"
      from_beginning = true
      log_group_name = "jfharden-experiment__bastion"
      log_stream_name = "ip-10-0-1-120.eu-west-1.compute.internal/dmesg"
      pipe = false
      retention_in_days = -1

    [[inputs.logfile.file_config]]
      file_path = "/var/log/yum.log*"
      from_beginning = true
      log_group_name = "jfharden-experiment__bastion"
      log_stream_name = "ip-10-0-1-120.eu-west-1.compute.internal/yum"
      pipe = false
      retention_in_days = -1
      timestamp_layout = "Jan 2 15:04:05"
      timestamp_regex = "(\\w{3} \\s{0,1}\\d{1,2} \\d{2}:\\d{2}:\\d{2})"

    [[inputs.logfile.file_config]]
      file_path = "/var/log/audit/audit.log*"
      from_beginning = true
      log_group_name = "jfharden-experiment__bastion"
      log_stream_name = "ip-10-0-1-120.eu-west-1.compute.internal/audit"
      pipe = false
      retention_in_days = -1

[outputs]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "i-0cc4e78ba7c61bde5"
    region = "eu-west-1"
2023/11/23 16:39:42 I! Config has been translated into YAML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml
2023/11/23 16:39:42 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2023/11/23 16:39:42 I! Valid Json input schema.
2023/11/23 16:39:42 I! Detected runAsUser: cwagent
2023/11/23 16:39:42 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 995:993
2023/11/23 16:39:42 I! Set HOME: /home/cwagent
2023-11-23T16:39:42Z I! Starting AmazonCloudWatchAgent CWAgent/1.300028.1 (go1.20.7; linux; amd64)
2023-11-23T16:39:42Z I! AWS SDK log level not set
2023-11-23T16:39:42Z I! creating new logs agent
2023-11-23T16:39:42Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-1-120.eu-west-1.compute.internal", Flush Interval:1s
2023-11-23T16:39:42Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log* /var/log/audit/audit.log*]
2023-11-23T16:39:42Z I! [inputs.logfile] turned on logs plugin
2023-11-23T16:39:42Z I! [logagent] starting
2023-11-23T16:39:42Z I! [logagent] found plugin cloudwatchlogs is a log backend
2023-11-23T16:39:42Z I! [logagent] found plugin logfile is a log collection
2023-11-23T16:39:42Z I! [logagent] start logs plugin file paths [/var/log/messages* /var/log/secure* /var/log/dmesg /var/log/yum.log* /var/log/audit/audit.log*]
2023-11-23T16:39:42Z I! [inputs.logfile] turned on logs plugin
2023-11-23T16:39:43Z I! [inputs.logfile] Reading from offset 151834 in /var/log/messages
2023-11-23T16:39:43Z E! [inputs.logfile] Failed to tail file /var/log/messages with error: open /var/log/messages: permission denied
2023-11-23T16:39:43Z I! [inputs.logfile] Reading from offset 2364 in /var/log/secure
2023-11-23T16:39:43Z E! [inputs.logfile] Failed to tail file /var/log/secure with error: open /var/log/secure: permission denied
2023-11-23T16:39:43Z I! [inputs.logfile] Reading from offset 27271 in /var/log/dmesg
2023-11-23T16:39:43Z I! [inputs.logfile] Reading from offset 77 in /var/log/yum.log
2023-11-23T16:39:43Z E! [inputs.logfile] Failed to tail file /var/log/yum.log with error: open /var/log/yum.log: permission denied
2023-11-23T16:39:43Z I! [logagent] piping log from jfharden-experiment__bastion/ip-10-0-1-120.eu-west-1.compute.internal/dmesg(/var/log/dmesg) to cloudwatchlogs with retention -1
2023-11-23T16:39:44Z I! [inputs.logfile] Reading from offset 151834 in /var/log/messages
2023-11-23T16:39:44Z E! [inputs.logfile] Failed to tail file /var/log/messages with error: open /var/log/messages: permission denied
2023-11-23T16:39:44Z I! [inputs.logfile] Reading from offset 2364 in /var/log/secure
2023-11-23T16:39:44Z E! [inputs.logfile] Failed to tail file /var/log/secure with error: open /var/log/secure: permission denied
2023-11-23T16:39:44Z I! [inputs.logfile] Reading from offset 77 in /var/log/yum.log
2023-11-23T16:39:44Z E! [inputs.logfile] Failed to tail file /var/log/yum.log with error: open /var/log/yum.log: permission denied
2023-11-23T16:39:45Z I! [inputs.logfile] Reading from offset 151834 in /var/log/messages
2023-11-23T16:39:45Z E! [inputs.logfile] Failed to tail file /var/log/messages with error: open /var/log/messages: permission denied
2023-11-23T16:39:45Z I! [inputs.logfile] Reading from offset 2364 in /var/log/secure
2023-11-23T16:39:45Z E! [inputs.logfile] Failed to tail file /var/log/secure with error: open /var/log/secure: permission denied
2023-11-23T16:39:45Z I! [inputs.logfile] Reading from offset 77 in /var/log/yum.log
2023-11-23T16:39:45Z E! [inputs.logfile] Failed to tail file /var/log/yum.log with error: open /var/log/yum.log: permission denied
2023-11-23T16:39:46Z I! [inputs.logfile] Reading from offset 151834 in /var/log/messages
2023-11-23T16:39:46Z E! [inputs.logfile] Failed to tail file /var/log/messages with error: open /var/log/messages: permission denied
2023-11-23T16:39:46Z I! [inputs.logfile] Reading from offset 2364 in /var/log/secure
2023-11-23T16:39:46Z E! [inputs.logfile] Failed to tail file /var/log/secure with error: open /var/log/secure: permission denied
2023-11-23T16:39:46Z I! [inputs.logfile] Reading from offset 77 in /var/log/yum.log
2023-11-23T16:39:46Z E! [inputs.logfile] Failed to tail file /var/log/yum.log with error: open /var/log/yum.log: permission denied
2023-11-23T16:39:47Z I! [inputs.logfile] Reading from offset 151834 in /var/log/messages

and so on forever or until the next restart

I think it's especially interesting that one of the log files was actually opened successfully (/var/log/dmesg) and all the others receive permission denied errors (the logs continue to

For the moment I'm going to remediate the issue by making the server "self-healing" and add the following to our userdata:

mkdir -p /opt/govukpay/
cat >> /opt/govukpay/remediate-amazon-cloudwatch-log-permission-denied-errors.sh <<EOF
#!/bin/bash

if tail /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log | grep "permission denied"; then
  logger "Amazon CloudWatch Agent has permission denied on logs. Restarting amazon-cloudwatch-agent service"

  if systemctl restart amazon-cloudwatch-agent; then
    logger "amazon-cloudwatch-agent succesfully restarted"
  else
    logger "Failed to restart amazon-cloudwatch-agent"
  fi
fi
EOF
chmod 500 /opt/govukpay/remediate-amazon-cloudwatch-log-permission-denied-errors.sh

cat >> /etc/cron.d/remediate-amazon-cloudwatch-log-permission-denied-errors <<EOF
# Run the hourly jobs
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
*/5 * * * * root /opt/govukpay/remediate-amazon-cloudwatch-log-permission-denied-errors.sh
EOF

Hi,
Thank you for all the information you have provided It is expected for agent to run as root if a specific user is not assigned in the configuration. It seems like the problem is related to permissions assigned to cwagent user rather than the agent itself.This seems like a customer environment setup issue rather than the agent issue, requiring no action from agent team. Is there anything else I can help you with this issue? If not I will be closing this issue.

For what it is worth, this is happening to me to.

I had configuration which was running fine for perhaps years. In the Summer I was making upgrades, e.g. flip to Ubuntu:22.04, I probably also absorbed the latest AMI.

The service I was running would periodically get stuck (I don't think related to AWS), so I used CloudWatch to look for a log message indicating it was sick. I'd then use a Lambda to kill the sick server and have it re-start.

But then I found that periodically the server would stop emitting log messages, meaning I didn't know it was stuck. I didn't root cause this issue properly, and instead created another alarm to check I was receiving log messages; and if I was not, I would kill the server and re-start it.

I've since upgraded another ECS task to Ubuntu 22.04 and found the same problem is happening with that service too.

If I log in to the server, I find cwagent struggling to read log files.

2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/syslog with error: open /var/log/syslog: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/mailman/smtp.log with error: open /var/log/mailman/smtp.log: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/mailman/bounce.log with error: open /var/log/mailman/bounce.log: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/mailman/mailman.log with error: open /var/log/mailman/mailman.log: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/mailman/http.log with error: open /var/log/mailman/http.log: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/uwsgi-backend.log with error: open /var/log/uwsgi-backend.log: permission denied
2024-04-17T01:02:42Z E! [inputs.logfile] Failed to tail file /var/log/uwsgi.log with error: open /var/log/uwsgi.log: permission denied

But sudo -u cwagent has no problem at all tailing the logs:

root@405f80ef7893:/var/log/amazon/amazon-cloudwatch-agent# sudo -u cwagent tail /var/log/syslog
Apr 16 20:17:01 405f80ef7893 CRON[879]: (root) CMD ([880]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 20:17:01 405f80ef7893 CRON[879]: (root) END ([880]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 21:17:01 405f80ef7893 CRON[899]: (root) CMD ([900]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 21:17:01 405f80ef7893 CRON[899]: (root) END ([900]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 22:17:01 405f80ef7893 CRON[918]: (root) CMD ([919]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 22:17:01 405f80ef7893 CRON[918]: (root) END ([919]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 23:17:01 405f80ef7893 CRON[937]: (root) CMD ([938]    cd / && run-parts --report /etc/cron.hourly)
Apr 16 23:17:01 405f80ef7893 CRON[937]: (root) END ([938]    cd / && run-parts --report /etc/cron.hourly)
Apr 17 00:17:01 405f80ef7893 CRON[958]: (root) CMD ([959]    cd / && run-parts --report /etc/cron.hourly)
Apr 17 00:17:01 405f80ef7893 CRON[958]: (root) END ([959]    cd / && run-parts --report /etc/cron.hourly)

Killing the agent the first time didn't fix the issue:

2024/04/17 01:04:44 I! Config has been translated into YAML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml
2024/04/17 01:04:44 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2024/04/17 01:04:44 I! Valid Json input schema.
2024/04/17 01:04:44 I! Detected runAsUser: cwagent
2024/04/17 01:04:44 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 999:999
2024/04/17 01:04:44 I! Set HOME: /home/cwagent
2024-04-17T01:04:44Z I! Starting AmazonCloudWatchAgent CWAgent/1.300035.0b547 (go1.22.1; linux; amd64) with log file /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log with log tar>
2024-04-17T01:04:44Z I! AWS SDK log level not set
2024-04-17T01:04:44Z I! creating new logs agent
2024-04-17T01:04:44Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"405f80ef7893", Flush Interval:1s
2024-04-17T01:04:44Z I! [logagent] start logs plugin file paths [/var/log/syslog /var/log/mailman/smtp.log /var/log/mailman/bounce.log /var/log/mailman/mailman.log /var/log/mailman/http.log /var/>
2024-04-17T01:04:44Z I! [logagent] starting
2024-04-17T01:04:44Z I! [logagent] found plugin cloudwatchlogs is a log backend
2024-04-17T01:04:44Z I! [logagent] found plugin logfile is a log collection
2024-04-17T01:04:44Z I! [logagent] start logs plugin file paths [/var/log/syslog /var/log/mailman/smtp.log /var/log/mailman/bounce.log /var/log/mailman/mailman.log /var/log/mailman/http.log /var/>
2024-04-17T01:04:44Z I! [inputs.logfile] turned on logs plugin
2024-04-17T01:04:44Z I! [inputs.logfile] turned on logs plugin
2024-04-17T01:04:45Z E! [inputs.logfile] Failed to tail file /var/log/syslog with error: open /var/log/syslog: permission denied
2024-04-17T01:04:45Z E! [inputs.logfile] Failed to tail file /var/log/mailman/smtp.log with error: open /var/log/mailman/smtp.log: permission denied

Killing it a second time did

2024/04/17 01:07:02 I! Config has been translated into YAML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml
2024/04/17 01:07:02 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2024/04/17 01:07:02 I! Valid Json input schema.
2024/04/17 01:07:02 I! Detected runAsUser: cwagent
2024/04/17 01:07:02 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 999:999
2024/04/17 01:07:02 I! Set HOME: /home/cwagent
2024-04-17T01:07:02Z I! Starting AmazonCloudWatchAgent CWAgent/1.300035.0b547 (go1.22.1; linux; amd64) with log file /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log with log tar>
2024-04-17T01:07:02Z I! AWS SDK log level not set
2024-04-17T01:07:02Z I! creating new logs agent
2024-04-17T01:07:02Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"405f80ef7893", Flush Interval:1s
2024-04-17T01:07:02Z I! [logagent] start logs plugin file paths [/var/log/syslog /var/log/mailman/smtp.log /var/log/mailman/bounce.log /var/log/mailman/mailman.log /var/log/mailman/http.log /var/>
2024-04-17T01:07:02Z I! [inputs.logfile] turned on logs plugin
2024-04-17T01:07:02Z I! [logagent] starting
2024-04-17T01:07:02Z I! [logagent] found plugin cloudwatchlogs is a log backend
2024-04-17T01:07:02Z I! [logagent] found plugin logfile is a log collection
2024-04-17T01:07:02Z I! [logagent] start logs plugin file paths [/var/log/syslog /var/log/mailman/smtp.log /var/log/mailman/bounce.log /var/log/mailman/mailman.log /var/log/mailman/http.log /var/>
2024-04-17T01:07:02Z I! [inputs.logfile] turned on logs plugin
2024-04-17T01:07:03Z I! [inputs.logfile] Reading from offset 1120 in /var/log/rescuequeue.log
2024-04-17T01:07:03Z I! [outputs.cloudwatchlogs] Configured middleware on AWS client
2024-04-17T01:07:03Z I! [logagent] piping log from /ecs/mailman/172.31.2.167_i-031718c3e8271da75_syslog(/var/log/syslog) to cloudwatchlogs with retention -1
2024-04-17T01:07:03Z I! [outputs.cloudwatchlogs] Configured middleware on AWS client
2024-04-17T01:07:03Z I! [logagent] piping log from /ecs/mailman/172.31.2.167_i-031718c3e8271da75_smtp(/var/log/mailman/smtp.log) to cloudwatchlogs with retention -1
2024-04-17T01:07:03Z I! [outputs.cloudwatchlogs] Configured middleware on AWS client

Between these attempts I made no changes to the permissions. I conclude there must be a software fault in the cloud watch logs agent and recommend further investigation.

In the mean time I'll amend my cwagent to run as root, which is not really a preferred configuration, and I propose this issue should be re-opened.

FYI I happen to work for Amazon, but this project is not related to my Amazon employment. But if AWS tech staff want to contact me, you'll find me in the corporate directory.

I don't seem to be able to re-open the issue, so have created #1140 and referenced this issue.