swap deployment fails when securityContext contains unprivileged user
kunickiaj opened this issue ยท 20 comments
What were you trying to do?
trying to use the swap-deployment feature with one of my deployments.
What did you expect to happen?
expected to expose two ports for a local process and have traffic directed to them
What happened instead?
telepresence died with the attached traceback
full log in gist: https://gist.github.com/kunickiaj/080328802f437cdc1fbb6722856de4ee
It seems that the root cause is the securityContext in the container I wished to swap.
Other (more privileged) containers do not have this issue. Was able to confirm that removing the following securityContext from the affected container allowed me to work around the issue:
securityContext:
runAsNonRoot: true
runAsUser: 500
Probably related to #617 #737 and #723
A possible fix might be to have telepresence replace the relevant parts of the security context if it does in fact need root (e.g. removing the runAsNonRoot). Would also suggest alerting the user to those kind of modifications.
Automatically included information
Command line: ['/usr/local/bin/telepresence', '--swap-deployment', 'sch-control-hub-pipelinestore:pipelinestore', '--expose', '18631', '--expose', '18632']
Version: 0.96
Python version: 3.6.6 (default, Oct 4 2018, 20:50:27) [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]
kubectl version: Client Version: v1.13.0 // Server Version: v1.10.0
oc version: oc v3.11.0+0cbc58b // kubernetes v1.11.0+d4cacc0 // features: Basic-Auth // // Server https://192.168.37.162:8443 // kubernetes v1.10.0
OS: Darwin streamsam381331.nerdworld.xyz 18.2.0 Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 x86_64
Traceback (most recent call last):
File "/usr/local/bin/telepresence/telepresence/cli.py", line 131, in crash_reporting
yield
File "/usr/local/bin/telepresence/telepresence/main.py", line 70, in main
socks_port, ssh = do_connect(runner, remote_info)
File "/usr/local/bin/telepresence/telepresence/connect/connect.py", line 99, in do_connect
return connect(runner_, remote_info, is_container_mode, args.expose)
File "/usr/local/bin/telepresence/telepresence/connect/connect.py", line 57, in connect
ssh.wait()
File "/usr/local/bin/telepresence/telepresence/connect/ssh.py", line 82, in wait
raise RuntimeError("SSH isn't starting.")
RuntimeError: SSH isn't starting.
Logs:
20 | Handling connection for 52930
48.2 60 | Connection to 127.0.0.1 closed by remote host.
48.2 TEL | [60] exit 255 in 0.56 secs.
48.4 TEL | [61] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 52930 telepresence@127.0.0.1 /bin/true
48.5 20 | Handling connection for 52930
49.0 61 | Connection to 127.0.0.1 closed by remote host.
49.0 TEL | [61] exit 255 in 0.57 secs.
49.3 TEL | [62] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 52930 telepresence@127.0.0.1 /bin/true
49.3 20 | Handling connection for 52930
49.8 62 | Connection to 127.0.0.1 closed by remote host.
49.8 TEL | [62] exit 255 in 0.57 secs.
51.1 19 | 2018-12-15T00:16:01+0000 [Poll#error] Failed to contact Telepresence client:
51.1 19 | 2018-12-15T00:16:01+0000 [Poll#error] An error occurred while connecting: 99: Address not available.
51.1 19 | 2018-12-15T00:16:01+0000 [Poll#warn] Perhaps it's time to exit?
Thank you for the issue. Yup, this is #723. And thank you for the suggestions.
In fact, we can do better. If the original container didn't need root (to bind to low ports), then Telepresence doesn't need it either. The unprivileged Telepresence image is hard-coded to run as UID 1000. When the user wants to swap, Tel should notice that the original deployment has runAsUser
and modify the swapped copy to request UID 1000. And yes, we should notify the user.
I'm not sure having the swapped copy request UID 1000 is the solution; that overlaps with a lot of things, including the default initial user for CentOS (which is the "admin" user in a number of deployments, so we can't use it in live deployments on my side of things). Perhaps it would be better to modify so that the instance can run as other UIDs? This is not an unusual use case, at least until Kubernetes gets some sort of UID/GID namespacing capability.
I'm also not sure quite what the problem is here; when I leave the deployment up and running, I can still connect via SSH to the server, though I haven't extensively probed to see what can be executed. In general, it's probably best to try to make sure that the provisioned executables can be executed under any UID/GID, since you don't have much way of controlling how people deploy them (and just assuming they can use the hardcoded one is, uh, a bit of an assumption).
OK, I tracked it down: even if I relax the permissions in telepresence-k8s
on the SSH host secrets and local directory a bit (which makes me uneasy anyway), the problem is that by default, the SSH daemon is running as the user we're expecting to log in. If the securityContext
has a different runAsUser
applied, the SSH connection is still trying to log in as telepresence
, which has a fixed UID of 1000. The SSH server can't switch UID to 1000 from the other one, so it barfs.
Short of k8s supporting Docker's UID namespacing (which, IIRC, is still experimental and not likely to land in k8s anytime soon), this ends up being a core problem; I don't think there's any way to just run an SSH service that doesn't try to switch to a particular user, which would be the obvious solution here if it existed. The other way would be to change the UID of telepresence
to that of the current user on first startup, which is messy at best and definitely more than a little risky from a security standpoint.
Thoughts?
@david-l-riley UsePrivilegeSeparation no
and libnsswrapper
can solve the ssh daemon uid problems. Check this out: https://github.com/blacksaltIT/docker_ssh
I hope it helps
@janosroden That approach requires modifying the root filesystem of the container on startup, before launching sshd. This is in fact what Telepresence used to do, but that caused all sorts of problems due to other restrictive Kubernetes setups. What we really need is an ssh server that doesn't rely on /etc/passwd and friends. We could build something using Twisted Conch or some Go stuff or whatever else. We'd love a PR addressing that.
Does it really need to be SSH, strictly speaking? That is, would other tunneling solutions be acceptable? Or is it preferred to stick with SSH for Reasons?
Telepresence uses SSH because it covers volumes (via sshfs), networking (via sshuttle), and port forwarding. Other solutions (combinations of tools, perhaps) could work too.
Just checking. Removing it would remove some complexity, but it sounds like at the cost of adding significant other complexity. I'll see about opening a PR for either the Go approach (I do like the look of it, but it'll need to be built for all supported architectures) or the Twisted Conch approach (since we already use it).
Looks like I probably also need to look into how sshfs and sshuttle work to determine which solution is going to be optimal for those...
Ok, if we can't set up ssh without allowing use of specific uid, can we add configuration option to select telepresence container serviceaccount? I'd like not to touch default sa, but instead tell telepresence to use it's own sa, with scc set up.
@dbazhal That's a good idea! Can you please create an issue requesting that as a new feature? We can work out how to make it happen there. Thank you.
@david-l-riley @dbazhal @kunickiaj - you may want to try this new image which should solve this issue.
If you give it a try, please share your feedback in the PR #1114 or here.
Image out of the new Dockerfile.no_runasany_perms can be used from here:
docker.io/researchiteng/telepresence:0.101
The suggestion of @janosroden -> it's not enough. It used to be required for older versions of sshd, it's not required any longer and does not seem to help this issue.
We have runAsUser
, runAsGroup
and runAsNonRoot
defined on our containers. As a result we run into this issue with the exact same error as in #1398 .
I think we can work around it by dropping runAsUser
and runAsGroup
and modifying all our docker image builds to use a UID instead of a username such that Kubernetes can verify that the user is a non-root user. At least when testing this it works fine.
It is still pretty inconvenient so I would also be interested in a fix for telepresence itself.
I have a issue too when I have security context below:
securityContext:
runAsUser: 10111
runAsGroup: 10111
runAsNonRoot: true
fsGroup: 10111
I failed that I can't mount my host keys.
Same issue here
I managed to get it working by changing runAsGroup to "0". (runAsUser was already 1000). However, I'm wondering whether setting runAsGroup to root isn't a security risk? Is there a specific reason for Telepresence setting it to "0", while setting the user to "1000" ?
I think this still may be an issue in Telepresence 2, so we should do some investigation before we close this ticket.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment, or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.