Telegraf: inputs.docker can no longer access Docker socket due to recent entrypoint.sh changes
moorglade opened this issue · 16 comments
Summary
Recent changes in Telegraf's entrypoint.sh
make it impossible for the docker
input to access the Docker socket mounted in the Telegraf container.
Details
To use the docker
input, one must mount the Docker socket to the Telegraf container, as described in the documentation. The documentation also states that:
Additionally docker
telegraf
user must be assigned todocker
group id from host.
This used to work fine in previous versions, but recent changes in Telegraf's entrypoint.sh
cause the container to drop all user groups (i.e. --regid telegraf --groups telegraf
).
This causes the following error:
[inputs.docker] Error in plugin: permission denied while trying to connect to the Docker daemon socket at unix:///host/var/run/docker.sock: Get "http://%2Fhost%2Fvar%2Frun%2Fdocker.sock/v1.24/info": dial unix /host/var/run/docker.sock: connect: permission denied
Related issues
Hi,
Can you provide some more details about how you are trying to do this, such that it is no longer working?
Looking at the previous issue, I used my previous suggestion of --user telegraf:$(stat -c '%g' /var/run/docker.sock)
and it appears to work as expected:
The group I specified is still part of the telegraf user:
telegraf@4027e6d38705:/$ groups
groups: cannot find name for group ID 961
961
And collects stats as expected:
docker run --user telegraf:$(stat -c '%g' /var/run/docker.sock) -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/config.toml:/etc/telegraf/telegraf.conf telegraf:latest
2024-02-22T14:46:27Z I! Loading config: /etc/telegraf/telegraf.conf
2024-02-22T14:46:27Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
2024-02-22T14:46:27Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
2024-02-22T14:46:27Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-02-22T14:46:27Z I! Loaded inputs: docker
2024-02-22T14:46:27Z I! Loaded aggregators:
2024-02-22T14:46:27Z I! Loaded processors:
2024-02-22T14:46:27Z I! Loaded secretstores:
2024-02-22T14:46:27Z I! Loaded outputs: file
2024-02-22T14:46:27Z I! Tags enabled: host=71b9c160f000
2024-02-22T14:46:27Z W! Deprecated inputs: 0 and 1 options
2024-02-22T14:46:27Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"71b9c160f000", Flush Interval:10s
2024-02-22T14:46:27Z D! [agent] Initializing plugins
2024-02-22T14:46:27Z D! [agent] Connecting outputs
2024-02-22T14:46:27Z D! [agent] Attempting connection to [outputs.file]
2024-02-22T14:46:27Z D! [agent] Successfully connected to outputs.file
2024-02-22T14:46:27Z D! [agent] Starting service inputs
2024-02-22T14:46:37Z D! [outputs.file] Wrote batch of 7 metrics in 86.761µs
2024-02-22T14:46:37Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
docker,engine_host=ryzen,host=71b9c160f000,server_version=25.0.2 n_images=1i,n_goroutines=58i,n_listener_events=0i,n_containers_running=1i,n_used_file_descriptors=33i,n_containers=4i,n_containers_stopped=3i,n_containers_paused=0i,n_cpus=32i 1708613190000000000
docker,engine_host=ryzen,host=71b9c160f000,server_version=25.0.2 memory_total=67333791744i 1708613190000000000
docker_container_status,container_image=telegraf,container_name=nice_grothendieck,container_status=running,container_version=latest,engine_host=ryzen,host=71b9c160f000,server_version=25.0.2 pid=19710i,exitcode=0i,restart_count=0i,container_id="71b9c160f000f32ebadf435a26e1b4363867ed4f20bf0d0e67d343ced8bcad4c",started_at=1708613187806809377i,uptime_ns=3209109177i,oomkilled=false 1708613191000000000
docker_container_mem,container_image=telegraf,container_name=nice_grothendieck,container_status=running,container_version=latest,engine_host=ryzen,host=71b9c160f000,server_version=25.0.2 usage_percent=0.3771173691887432,inactive_anon=0i,inactive_file=151552i,pgfault=6138i,pgmajfault=66i,unevictable=0i,max_usage=0i,usage=253927424i,container_id="71b9c160f000f32ebadf435a26e1b4363867ed4f20bf0d0e67d343ced8bcad4c",active_anon=41172992i,active_file=210833408i,limit=67333791744i 1708613191000000000
<snip>
Thanks for the reponse, now I see the difference in my configuration: instead of --user telegraf:$(stat -c '%g' /var/run/docker.sock)
I am using --user root:$(stat -c '%g' /var/run/docker.sock)
.
The reason for this is that if I set the user to telegraf
instead of root
, the entrypoint does not set the required capabilities on the telegraf
binary, and some other plugins stop working (#560 (comment)).
Example configuration:
[[inputs.docker]]
endpoint = "unix:///host/var/run/docker.sock"
[[inputs.ping]]
urls = ["github.com"]
method = "native"
[[outputs.file]]
files = ["stdout"]
Error for --user telegraf:$(stat -c '%g' /var/run/docker.sock)
:
[inputs.ping] ping failed: permission changes required, enable CAP_NET_RAW capabilities (refer to the ping plugin's README.md for more info)
Error for --user root:$(stat -c '%g' /var/run/docker.sock)
(this used to work for me with previous container images):
[inputs.docker] Error in plugin: permission denied while trying to connect to the Docker daemon socket at unix:///host/var/run/docker.sock: Get "http://%2Fhost%2Fvar%2Frun%2Fdocker.sock/v1.24/info": dial unix /host/var/run/docker.sock: connect: permission denied
I checked the ping
's input README and it seems this can be worked around by using method = "exec"
instead of method = "native"
.
If this is expected and I should just not use ping
's method = "native"
together with docker
input, feel free to close the issue.
This scenario seems like a regression in behavior, so I want your thoughts on this. The scenario is when a user is trying to monitor docker via the socket + use ping.
To monitor docker the user needs to pass an additional group to telegraf to have permissions to use the socket. To use ping, we previous set capabilities on the telegraf binary in the entrypoint, but only if you are root.
When running as root, now that we are dropping all groups, including user-specified groups, the user can no longer do both at the same time.
Working with v1.29.4:
$ docker run --rm --user root:$(stat -c '%g' /var/run/docker.sock) -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/config.toml:/etc/telegraf/telegraf.conf telegraf:1.29.4
Unable to find image 'telegraf:1.29.4' locally
1.29.4: Pulling from library/telegraf
7bb465c29149: Already exists
2b9b41aaa3c5: Already exists
c7c71dd3592a: Already exists
9140cc5510d6: Already exists
aab5bc94bab0: Pull complete
6396348f0ac2: Pull complete
Digest: sha256:d883b097fbbb1ed1db5fb1430a2d767ab72b423cf3cbb065bb274ff030d6311d
Status: Downloaded newer image for telegraf:1.29.4
2024-02-23T15:00:00Z I! Loading config: /etc/telegraf/telegraf.conf
2024-02-23T15:00:00Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
2024-02-23T15:00:00Z I! Starting Telegraf 1.29.4 brought to you by InfluxData the makers of InfluxDB
2024-02-23T15:00:00Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-02-23T15:00:00Z I! Loaded inputs: docker ping
2024-02-23T15:00:00Z I! Loaded aggregators:
2024-02-23T15:00:00Z I! Loaded processors:
2024-02-23T15:00:00Z I! Loaded secretstores:
2024-02-23T15:00:00Z I! Loaded outputs: file
2024-02-23T15:00:00Z I! Tags enabled: host=6d2e075490ba
2024-02-23T15:00:00Z W! Deprecated inputs: 0 and 1 options
2024-02-23T15:00:00Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"6d2e075490ba", Flush Interval:10s
2024-02-23T15:00:00Z D! [agent] Initializing plugins
2024-02-23T15:00:00Z D! [agent] Connecting outputs
2024-02-23T15:00:00Z D! [agent] Attempting connection to [outputs.file]
2024-02-23T15:00:00Z D! [agent] Successfully connected to outputs.file
2024-02-23T15:00:00Z D! [agent] Starting service inputs
ping,host=6d2e075490ba,url=192.168.1.1 result_code=0i,packets_transmitted=1i,maximum_response_ms=0.841078,packets_received=1i,ttl=63i,percent_packet_loss=0,minimum_response_ms=0.841078,average_response_ms=0.841078,standard_deviation_ms=0 1708700410000000000
docker,engine_host=ryzen,host=6d2e075490ba,server_version=25.0.2 n_cpus=32i,n_containers_paused=0i,n_images=2i,n_goroutines=58i,n_listener_events=0i,n_used_file_descriptors=31i,n_containers=1i,n_containers_running=1i,n_containers_stopped=0i 1708700410000000000
docker,engine_host=ryzen,host=6d2e075490ba,server_version=25.0.2 memory_total=67333787648i 1708700410000000000
2024-02-23T15:00:10Z D! [outputs.file] Wrote batch of 3 metrics in 56.241µs
and now with latest:
$ docker run --rm --user root:$(stat -c '%g' /var/run/docker.sock) -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/config.toml:/etc/telegraf/telegraf.conf telegraf:1.29.5
2024-02-23T15:00:47Z I! Loading config: /etc/telegraf/telegraf.conf
2024-02-23T15:00:47Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
2024-02-23T15:00:47Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
2024-02-23T15:00:47Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-02-23T15:00:47Z I! Loaded inputs: docker ping
2024-02-23T15:00:47Z I! Loaded aggregators:
2024-02-23T15:00:47Z I! Loaded processors:
2024-02-23T15:00:47Z I! Loaded secretstores:
2024-02-23T15:00:47Z I! Loaded outputs: file
2024-02-23T15:00:47Z I! Tags enabled: host=7e9459d14147
2024-02-23T15:00:47Z W! Deprecated inputs: 0 and 1 options
2024-02-23T15:00:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"7e9459d14147", Flush Interval:10s
2024-02-23T15:00:47Z D! [agent] Initializing plugins
2024-02-23T15:00:47Z D! [agent] Connecting outputs
2024-02-23T15:00:47Z D! [agent] Attempting connection to [outputs.file]
2024-02-23T15:00:47Z D! [agent] Successfully connected to outputs.file
2024-02-23T15:00:47Z D! [agent] Starting service inputs
2024-02-23T15:00:50Z E! [inputs.docker] Error in plugin: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": dial unix /var/run/docker.sock: connect: permission denied
2024-02-23T15:00:50Z E! [inputs.docker] Error in plugin: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json?filters=%7B%22status%22%3A%7B%22running%22%3Atrue%7D%7D": dial unix /var/run/docker.sock: connect: permission denied
2024-02-23T15:00:57Z D! [outputs.file] Wrote batch of 1 metrics in 42.41µs
2024-02-23T15:00:57Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
ping,host=7e9459d14147,url=192.168.1.1 maximum_response_ms=0.321623,result_code=0i,packets_transmitted=1i,packets_received=1i,percent_packet_loss=0,minimum_response_ms=0.321623,ttl=63i,average_response_ms=0.321623,standard_deviation_ms=0 1708700450000000000
While I agree this is a regression, it underscores how there was a problem before. I think the path forward is deciding on what groups to keep. Can we somehow detect what groups docker gave (eg, is this in an env var)? If not, perhaps we could honor an env var that the user can set? Eg (untested):
groups="telegraf"
if [ -n "$TELEGRAF_GROUPS" ]; then
groups="$groups,$TELEGRAF_GROUPS"
fi
exec setpriv --reuid telegraf --regid telegraf --groups "$groups" "$@"
While I agree this is a regression, it underscores how there was a problem before.
Can you reiterate what you found here, since that was in a private issue? My understanding was that it was due to the fact that we retained the root group.
While I agree this is a regression, it underscores how there was a problem before.
Can you reiterate what you found here, since that was in a private issue? My understand was that it was due to the fact that we retained the root group.
The issue was that the setpriv
command was intending to drop privileges to the telegraf
user, but it didn't drop group membership correctly (so, 'yes', root would've been retained). The root group grants a lot of privileges and in the case of this issue, it showed that it gave access to the docker socket, which for this plugin was a good thing, but for all others, it would not be. There is a lot more that group membership would give access to (not least of which, DAC checks within the kernel for various file and non-file access checks). Since there is clear intent to drop privileges when the container is started as root (an excellent thing to do!), we need to do it right and really drop, so we made this change.
That said, this issue shows there are cases that we need to handle when the user wants a specific behavior from the container related to group membership, so I put forth a couple of ideas on how to do that.
Since there is clear intent to drop privileges when the container is started as root (an excellent thing to do!), we need to do it right and really drop, so we made this change
When I made the root change the goal was around not running as root. Dropping everything, including groups passed in by a user was certainly not the intent. If a user provides a group via the user or group-add argument, I would expect that group to get passed on.
Can we somehow detect what groups docker gave (eg, is this in an env var)?
The only way I see the groups show up was via id
, nothing in /etc/groups
or the env.
If not, perhaps we could honor an env var that the user can set? Eg (untested):
We are now in a state where if you start up as the telegraf user, you can pass in groups and have them work as expected. However, if you start up as the root user, you cannot. Can we drop the root group, as we are dropping the command from the root user, but not drop other groups a user has asked to apply.
In effect, does this not end up doing the same thing as passing a list of groups via an environment variable? Except not breaking our users and requiring them to make changes to their deployments.
If not, perhaps we could honor an env var that the user can set? Eg (untested):
We are now in a state where if you start up as the telegraf user, you can pass in groups and have them work as expected. However, if you start up as the root user, you cannot. Can we drop the root group, as we are dropping the command from the root user, but not drop other groups a user has asked to apply.
In effect, does this not end up doing the same thing as passing a list of groups via an environment variable? Except not breaking our users and requiring them to make changes to their deployments.
Yes, assuming that the group membership in the container was very intentional, which AIUI can happen in one of two ways: 1. during container build (through Dockerfile USER
or modifying /etc/group
in the container) or 2. use docker run --group-add foo
. We control '1' so don't have to worry about that. For '2', the only way to detect that is via id
. While I'm slightly uncomfortable with using id
within the container, but I'm not sure there is a better choice. Perhaps this:
# honor groups supplied via 'docker run --group-add ...' but drop 'root' (the sed
# removes 'telegraf' since we unconditionally add it and don't want it listed twice)
groups="telegraf"
extra_groups="$(id -Gn | sed \
-e 's/^\(root\|telegraf\)$//g' \
-e 's/^\(root\|telegraf\) //g' \
-e 's/ \(root\|telegraf\)$//g' \
-e 's/ \(root\|telegraf\)//g' \
-e 's/ /,/g')"
if [ -n "$extra_groups" ]; then
groups="$groups,$extra_groups"
fi
exec setpriv --reuid telegraf --regid telegraf --groups "$groups" "$@"
That sed
is ugly since it needs to backslash (
, )
and |
, but also tries to handle when the group is the only, the first, last or in the middle. The extra_groups
handles when id into sed comes up empty.
That sed isn't quite right. I'll give a better one.
This one should work with the understanding that id -Gn
will not have duplicates in the output:
# honor groups supplied via 'docker run --group-add ...' but drop 'root' (the sed
# removes 'telegraf' since we unconditionally add it and don't want it listed twice)
groups="telegraf"
extra_groups="$(id -Gn | sed \
-e 's/ /,/g' \
-e 's/,\(root\|telegraf\),/,/g' \
-e 's/^\(root\|telegraf\),//g' \
-e 's/,\(root\|telegraf\)$//g' \
-e 's/^\(root\|telegraf\)$//g')"
if [ -n "$extra_groups" ]; then
groups="$groups,$extra_groups"
fi
exec setpriv --reuid telegraf --regid telegraf --groups "$groups" "$@"
It handles when the group is the only, the first, last or in the middle. It preserves groups like 'groot', 'rootbeer' and 'frooty'. The extra_groups handles when id into sed comes up empty.
Thank you for putting that together! I'll give it a look over this week so we can get this change in for v1.30.
Thanks again @jdstrand for the work on this.
I've put up #727 which makes the change to the nightly image. I've been playing with it a bit and I think it looks good, but wanted to get some additional feedback before making the changes to the other images.
Would you be opposed to landing that first and then we can land a change to the other images next week?
Would you be opposed to landing that first and then we can land a change to the other images next week?
That's fine by me
re-opening until we land this in other releases.
These changes will go out with the release of v1.30 on or around next Monday. The changes will apply to older images as well.
This issue is closed so I'm not sure anybody will even see this, but should this:
extra_groups="$(id -Gn || true)"
be changed to:
extra_groups="$(id -Gn telegraf || true)"
because without that it doesn't work right in a docker compose / buildx script. At the point that it runs id it is still root. I think what it really means to do there is to add the extra groups of the telegraf user.