upgrade process was interrupted - "Failed to set capabilities on file" errors
WS-Dave opened this issue · 7 comments
Relevent telegraf.conf
telegraf.conf doesn't impact the problem
It shows the same behavior if I try to install from scratch with out my config)
System info
Telegraf 1.19.2 vs newer versions, Synology DS1515+ DSM 6.2-25556, Docker 20.10.3-0554
Docker
sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" telegraf:latest
(causes the errors)
vs
sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" telegraf:1.19.2
(or earlier versions) works fine
Steps to reproduce
running any version newer than 1.19.2 causes restart loops and errors (see errors below)
Expected behavior
Telegraf container starts up and stays up/stable
Actual behavior
container goes into a restart loop and logs show these errors:
- Failed to set capabilities on file `/usr/bin/telegraf' (Operation not supported)
- The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file
the errors get re-posted every restart loop
Additional info
I use Watchtower to install the latest version of my Docker packages.
Within the last 24 hours, Watchtower updated my telegraf:latest and then the service would not start successfully it appears that something happened to corrupt the install.
I tried it with my current telegraf config as well as by trying a fresh install (with clean/empty config)
the errors are the same.
I found that if I (delete the container) and use telegraf:1.19.2 or earlier - everything works fine.
But I would obviously prefer to be back on telegraf:latest so I get the updated versions as they are released.
Please advise,
Thx!
Hi,
Is watchtower doing any customization to the telegraf image?
Are you using a custom Dockerfile?
When I use the following config:
[[inputs.cpu]]
[[outputs.file]]
It starts right up:
❯ docker run -itd --name="telegraf" --restart=always -v /home/powersj/telegraf/config.toml:/etc/telegraf/telegraf.conf -e "TZ=America/Los_Angeles" telegraf:latest
Unable to find image 'telegraf:latest' locally
latest: Pulling from library/telegraf
c4cc477c22ba: Pull complete
077c54d048f1: Pull complete
0368544993b2: Pull complete
184b981f0900: Pull complete
818f541bf2f4: Pull complete
ec8a09357eee: Pull complete
5459e6ed770e: Pull complete
Digest: sha256:ac3c525fa5b234d06a60107d3ba9f7307e2aed77f9f6473ea646b88bcd36145b
Status: Downloaded newer image for telegraf:latest
73b8062a86599d29e5f7192625263bb6a6e1e02c303f69e61c1ce9473561bd2d
❯ docker logs telegraf
2021-12-18T00:06:15Z I! Starting Telegraf 1.21.1
2021-12-18T00:06:15Z I! Using config file: /etc/telegraf/telegraf.conf
2021-12-18T00:06:15Z I! Loaded inputs: cpu
2021-12-18T00:06:15Z I! Loaded aggregators:
2021-12-18T00:06:15Z I! Loaded processors:
2021-12-18T00:06:15Z I! Loaded outputs: file
2021-12-18T00:06:15Z I! Tags enabled: host=73b8062a8659
2021-12-18T00:06:15Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"73b8062a8659", Flush Interval:10s
cpu,cpu=cpu0,host=73b8062a8659 usage_softirq=0,usage_guest=0,usage_idle=99.00000000000006,usage_system=0.7000000000000026,usage_nice=0,usage_iowait=0,usage_irq=0,usage_steal=0,usage_guest_nice=0,usage_user=0.3000000000000011 1639785990000000000
cpu,cpu=cpu1,host=73b8062a8659 usage_system=0.10020040080160311,usage_idle=99.69939879759518,usage_softirq=0,usage_guest_nice=0,usage_guest=0,usage_user=0.20040080160320622,usage_nice=0,usage_iowait=0,usage_irq=0,usage_steal=0 1639785990000000000
cpu,cpu=cpu2,host=73b8062a8659 usage_user=0.8008008008008031,usage_irq=0,usage_steal=0,usage_iowait=0,usage_softirq=0,usage_guest=0,usage_guest_nice=0,usage_system=0.10010010010010038,usage_idle=99.09909909909906,usage_nice=0 1639785990000000000
cpu,cpu=cpu3,host=73b8062a8659 usage_irq=0,usage_softirq=0,usage_user=0.10000000000000009,usage_nice=0,usage_iowait=0,usage_guest=0,usage_guest_nice=0,usage_system=0.20000000000000018,usage_idle=99.69999999999999,usage_steal=0 1639785990000000000
cpu,cpu=cpu4,host=73b8062a8659 usage_guest=0,usage_guest_nice=0,usage_user=0,usage_system=0,usage_irq=0,usage_softirq=0,usage_idle=100,usage_nice=0,usage_iowait=0,usage_steal=0 1639785990000000000
AFAIK - watchtower is not. I'm just using the default config on that so it's doing the most basic check/update thing.
The thing that is so puzzling is that it was working fine yesterday (with 1.21.0) as well as previous versions (all via Watchtower for probably a year) and then the error started.
It doesn't seem like an issue w/ 1.21.1 - because anything newer than 1.19.2 causes the same problem. I'm wondering if 1.19.3 added some new dependency that might be broken/missing on my setup (but I can't spot anything obvious in the change lists...)
This might get a bit confusing since we have a docker container image version and a Telegraf version.
The thing that is so puzzling is that it was working fine yesterday (with 1.21.0) as well as previous versions (all via Watchtower for probably a year) and then the error started.
In terms of docker containers, we never published a v1.21 Telegraf container with version v1.21.0 of Telegraf. This was because we got reports of an issue with logging/parsing Wednesday night and held off on updating the docker image until v1.21.1 of Telegraf was released on Thursday.
It doesn't seem like an issue w/ 1.21.1 - because anything newer than 1.19.2 causes the same problem. I'm wondering if 1.19.3 added some new dependency that might be broken/missing on my setup (but I can't spot anything obvious in the change lists...)
We publish docker images for the last three minor versions of Telegraf, but we do make changes to all three Dockerfiles as we do releases. Yesterday when we published v1.21, v1.20 and v1.19 of the Telegraf container images also got an update. In these updates, we made a change to use setcap
to give the telegraf binary some extra capabilities as the binary no longer runs as root.
Your error message shows:
The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file
Is there a way for you to see what /usr/bin/telegraf
is? via something like stat /usr/bin/telegraf
. The error makes me think something has changed it or it has become a symlink.
Thanks!
@WS-Dave do you happen to be running the container on a synology device (per system info it looks like it)? If so can you take a look at influxdata/influxdata-docker#561 please?
@powersj Thank you for your follow up.
Indeed - I'm running
Synology DS1515+ DSM 6.2-25556
Docker 20.10.3-0554
and the setcap/aufs/capsh issues seem to be biting me.
Is there a workaround? I couldn't identify one directly from influxdata/influxdata-docker#561
Maybe I'm missing something in terms of syntax and settings.
Thx!
I am not sure what options Synology's Docker plugin gives you, if possible you could try running the container as the 'telegraf' user to avoid trying to set the capabilities or use an older version of the container image
I am going to close this issue and track updating the error message and not failing on influxdata/influxdata-docker#561
Thanks!
huzzah!
Thank you for the follow up!
I added the user flag as follows
-u "telegraf"
and It Just Works (with 'latest')
sudo docker run -itd --name="telegraf" -p 8092:8092/udp -p 8094:8094/tcp -p 8125:8125/udp --restart=always -v /volume1/docker/telegraf/mibs:/usr/share/snmp/mibs -v /volume1/docker/telegraf/configuration:/etc/telegraf -v /volume1/docker/telegraf/logs:/var/log/telegraf -e "TZ=America/Los_Angeles" -u "telegraf" telegraf:latest
Hope this helps others - and Thanks Again!