influxdata/telegraf

upgrade process was interrupted - "Failed to set capabilities on file" errors

WS-Dave opened this issue · 7 comments

Relevent telegraf.conf

telegraf.conf doesn't impact the problem
It shows the same behavior if I try to install from scratch with out my config)

System info

Telegraf 1.19.2 vs newer versions, Synology DS1515+ DSM 6.2-25556, Docker 20.10.3-0554

Docker

sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" telegraf:latest
(causes the errors)

vs

sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" telegraf:1.19.2
(or earlier versions) works fine

Steps to reproduce

running any version newer than 1.19.2 causes restart loops and errors (see errors below)

Expected behavior

Telegraf container starts up and stays up/stable

Actual behavior

container goes into a restart loop and logs show these errors:

  • Failed to set capabilities on file `/usr/bin/telegraf' (Operation not supported)
  • The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file

the errors get re-posted every restart loop

Additional info

I use Watchtower to install the latest version of my Docker packages.

Within the last 24 hours, Watchtower updated my telegraf:latest and then the service would not start successfully it appears that something happened to corrupt the install.

I tried it with my current telegraf config as well as by trying a fresh install (with clean/empty config)
the errors are the same.

I found that if I (delete the container) and use telegraf:1.19.2 or earlier - everything works fine.

But I would obviously prefer to be back on telegraf:latest so I get the updated versions as they are released.

Please advise,
Thx!

Hi,

Is watchtower doing any customization to the telegraf image?
Are you using a custom Dockerfile?

When I use the following config:

[[inputs.cpu]]
[[outputs.file]]

It starts right up:

❯ docker run -itd --name="telegraf" --restart=always -v /home/powersj/telegraf/config.toml:/etc/telegraf/telegraf.conf -e "TZ=America/Los_Angeles" telegraf:latest
Unable to find image 'telegraf:latest' locally
latest: Pulling from library/telegraf
c4cc477c22ba: Pull complete 
077c54d048f1: Pull complete 
0368544993b2: Pull complete 
184b981f0900: Pull complete 
818f541bf2f4: Pull complete 
ec8a09357eee: Pull complete 
5459e6ed770e: Pull complete 
Digest: sha256:ac3c525fa5b234d06a60107d3ba9f7307e2aed77f9f6473ea646b88bcd36145b
Status: Downloaded newer image for telegraf:latest
73b8062a86599d29e5f7192625263bb6a6e1e02c303f69e61c1ce9473561bd2d
❯ docker logs telegraf
2021-12-18T00:06:15Z I! Starting Telegraf 1.21.1
2021-12-18T00:06:15Z I! Using config file: /etc/telegraf/telegraf.conf
2021-12-18T00:06:15Z I! Loaded inputs: cpu
2021-12-18T00:06:15Z I! Loaded aggregators: 
2021-12-18T00:06:15Z I! Loaded processors: 
2021-12-18T00:06:15Z I! Loaded outputs: file
2021-12-18T00:06:15Z I! Tags enabled: host=73b8062a8659
2021-12-18T00:06:15Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"73b8062a8659", Flush Interval:10s
cpu,cpu=cpu0,host=73b8062a8659 usage_softirq=0,usage_guest=0,usage_idle=99.00000000000006,usage_system=0.7000000000000026,usage_nice=0,usage_iowait=0,usage_irq=0,usage_steal=0,usage_guest_nice=0,usage_user=0.3000000000000011 1639785990000000000
cpu,cpu=cpu1,host=73b8062a8659 usage_system=0.10020040080160311,usage_idle=99.69939879759518,usage_softirq=0,usage_guest_nice=0,usage_guest=0,usage_user=0.20040080160320622,usage_nice=0,usage_iowait=0,usage_irq=0,usage_steal=0 1639785990000000000
cpu,cpu=cpu2,host=73b8062a8659 usage_user=0.8008008008008031,usage_irq=0,usage_steal=0,usage_iowait=0,usage_softirq=0,usage_guest=0,usage_guest_nice=0,usage_system=0.10010010010010038,usage_idle=99.09909909909906,usage_nice=0 1639785990000000000
cpu,cpu=cpu3,host=73b8062a8659 usage_irq=0,usage_softirq=0,usage_user=0.10000000000000009,usage_nice=0,usage_iowait=0,usage_guest=0,usage_guest_nice=0,usage_system=0.20000000000000018,usage_idle=99.69999999999999,usage_steal=0 1639785990000000000
cpu,cpu=cpu4,host=73b8062a8659 usage_guest=0,usage_guest_nice=0,usage_user=0,usage_system=0,usage_irq=0,usage_softirq=0,usage_idle=100,usage_nice=0,usage_iowait=0,usage_steal=0 1639785990000000000

AFAIK - watchtower is not. I'm just using the default config on that so it's doing the most basic check/update thing.

The thing that is so puzzling is that it was working fine yesterday (with 1.21.0) as well as previous versions (all via Watchtower for probably a year) and then the error started.

It doesn't seem like an issue w/ 1.21.1 - because anything newer than 1.19.2 causes the same problem. I'm wondering if 1.19.3 added some new dependency that might be broken/missing on my setup (but I can't spot anything obvious in the change lists...)

This might get a bit confusing since we have a docker container image version and a Telegraf version.

The thing that is so puzzling is that it was working fine yesterday (with 1.21.0) as well as previous versions (all via Watchtower for probably a year) and then the error started.

In terms of docker containers, we never published a v1.21 Telegraf container with version v1.21.0 of Telegraf. This was because we got reports of an issue with logging/parsing Wednesday night and held off on updating the docker image until v1.21.1 of Telegraf was released on Thursday.

It doesn't seem like an issue w/ 1.21.1 - because anything newer than 1.19.2 causes the same problem. I'm wondering if 1.19.3 added some new dependency that might be broken/missing on my setup (but I can't spot anything obvious in the change lists...)

We publish docker images for the last three minor versions of Telegraf, but we do make changes to all three Dockerfiles as we do releases. Yesterday when we published v1.21, v1.20 and v1.19 of the Telegraf container images also got an update. In these updates, we made a change to use setcap to give the telegraf binary some extra capabilities as the binary no longer runs as root.

Your error message shows:

The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file

Is there a way for you to see what /usr/bin/telegraf is? via something like stat /usr/bin/telegraf. The error makes me think something has changed it or it has become a symlink.

Thanks!

@WS-Dave do you happen to be running the container on a synology device (per system info it looks like it)? If so can you take a look at influxdata/influxdata-docker#561 please?

@powersj Thank you for your follow up.

Indeed - I'm running
Synology DS1515+ DSM 6.2-25556
Docker 20.10.3-0554

and the setcap/aufs/capsh issues seem to be biting me.

Is there a workaround? I couldn't identify one directly from influxdata/influxdata-docker#561
Maybe I'm missing something in terms of syntax and settings.

Thx!

I am not sure what options Synology's Docker plugin gives you, if possible you could try running the container as the 'telegraf' user to avoid trying to set the capabilities or use an older version of the container image

I am going to close this issue and track updating the error message and not failing on influxdata/influxdata-docker#561

Thanks!

huzzah!

Thank you for the follow up!

I added the user flag as follows
-u "telegraf"
and It Just Works (with 'latest')

sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" -u "telegraf" telegraf:latest

Hope this helps others - and Thanks Again!