kata-containers/runtime

Sandbox creating is slower when use_vsock enabled

BetaXOi opened this issue ยท 22 comments

Description of problem

Sandbox createing is slower when use vsock than virtio-serial.

Set 'enable_tracing = true' and create 10 kata containers with and without 'use_vsock=true', record the time of 'docker run' and opentracing span.
for i in seq 10; do time docker run --rm --runtime kata-runtime-test -tid busybox sh; sleep 1; done

Expected result

The time of creating shoud be similar.

Actual result

The time of using vsock is slower 2 second almost than virtio-serial.
image

image

The reason is 'connect' be invoked befor vsock is ready, and the timeout is 2 second in vsock kernel module.
#define VSOCK_DEFAULT_CONNECT_TIMEOUT (2 * HZ)

current sequence

/cc @amshinde ! Good find @BetaXOi

Hi guys

Are x86_64 still suffering this issue???
We've encountered the same problem on aarch64. The whole boot-up time with vsock enabled is affected by VSOCK_DEFAULT_CONNECT_TIMEOUT.

Reading the discussion under #1918 and https://lists.nongnu.org/archive/html/qemu-devel/2019-12/msg02225.html. we're trying to reverse vsock connection to avoid the time out???
Referring the @stefanha stefan's suggestion here:

This can be done efficiently as follows:
1. kata-runtime listens on a vsock port
2. kata-agent-port=PORT is added to the kernel command-line options
3. kata-agent parses the port number and connects to the host

He also pointed out other ways to avoid this 2s time out:

Userspace APIs to avoid the 2 second wait already exist:

1. The SO_VM_SOCKETS_CONNECT_TIMEOUT socket option controls the connect
   timeout for this socket.

2. Non-blocking connect allows the userspace process to do other things
   while a connection attempt is being made.

So, what's the progress here? ;)
cc @bergwolf @lifupan @gnawux @amshinde @devimc @grahamwhaley @jodh-intel @egernst @justin-he @jongwu

I still think the direction of the connection should be reversed and the agent should be launched with a kata-agent-port=PORT parameter.

Unfortunately I am currently busy with other work so I will not be able to implement this change.

@stefanha okay, I'll try to follow your thinking to do the implementation. ;)

@justin-he sent a patch for Linux vhost-vsock to fix this issue and now it is merged in the Linus' tree.
It will be released with Linux 5.7-rc5.

The patch should fix this issue, but I agree with @stefanha about the direction of the connection.

The issue with the kernel patch is that it leaves kata-runtime with a loop that consumes 100% CPU when attempting to connect before kata-agent has created the socket and put it into the listen state.

Besides, kata should not be 100% sure that host kernel must contain https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b841030625cde5f784dd62aec72d6a766faae70. Hence the proposal from @stefanha (reverse connection) still make sense IMO.

Hi @stefanha
I got a little confused here, even in my method of reversing connection, I just use it to do a quick handshake to avoid old kernel's 2s blocking issue.
I let runtime launch a goroutine to accept one connection from agent, to do a quick handshake, but, still, at the same time, the main routine needs to wait/block until the connection arrived, then we could proceed.
cc @lifupan

@stefanha thanks for the explaining.
I misunderstood what you said before. you want a total reverse connection, let runtime be server, not agent anymore.
my pr is just a quick handshake started by kata-agent to walk around the 2s timeout, but still agent will be server.
I'll close to issue a new one. ;)

Let alone the incompatibility of reversing the server/client roles, It would be another attack surface for runtime to listen for guest connections.

What about setting a short connecting timeout and busy loop connecting on the host(runtime) side? That would be much easier to implement and doesn't break existing compatibility.

Hi @stefanha @bergwolf
Here is one thing about reversing the whole connection i couldn't figure out:
Requests always come from kata-runtime to agent at uncertain time, like the hotplug issue.
So I couldn't figure a way in agent to avoid to set up a server to deal with all sorts of requests.

That's also part of reason of why I just reverse the connection in the first handshake in my earlier commits.

The whole kernel boot up with systemd will maybe cost aound 300ms?? In order to avoid this
300ms total busy loop, how about using non-block mode and poll a while??

The short lifetime of the runtime determines that it is not suitable for use as a server.

I have implemented the sample code for reverse connection before, but the code is really smell and dirty, so we did not use it inside our project.
BetaXOi@187ccb9
BetaXOi/agent@a111362

Another solution is to do busy waiting in runtime and implement connect timeout control, which is the solution we are currently using.
BetaXOi@6cd6780

@stefanha

Can you be more specific about how this is a new attack surface? The vsock
code in the host kernel is already being exercised by every sandbox. The
only difference is that the accept(2) code path is currently not being
exercised - but in exchange we stop exercising the connect(2) code path
when reversing the connection. Is there a significant difference here?

It is indeed the listen part but not just the kernel code it self, but also the fact that the runtime is listening for a guest connection. The runtime then is exposed for the guest to attack. IIRC when we started the project, it was purposely decided that the agent be the listener to avoid runtime being attacked from inside the guest.

When the socket is accepted, kata-runtime should use getpeername(2) to
verify that the remote CID matches the expected sandbox VM. This prevents
the attack you mentioned.

It is a possible mitigation and does not reduce the fact that the attack surface is increased. While we can solve it w/o introducing new attack surface, we should consider carefully if the new attack surface is really necessary.

And another reason we shouldn't go the reversing direction is that firecracker and cloud hypervisor don't support it. Then we'll end up with two diverging communication model.

if an attacker takes over an unprivileged host process they can run a busy loop that attempts to
connect(2) to vsock ports and can win the race with kata-runtime's connect(2).

That is a different threat model. The threat model Kata is following does not really handle untrusted host. It is the container apps we often do not trust.

The issue with the kernel patch is that it leaves kata-runtime with a loop that consumes 100% CPU when attempting to connect before kata-agent has created the socket and put it into the listen state.

I think the main problem is that the vhost implements its own workqueue but does not support delayed queuing. Is there any plan to fix it on the kernel side? While the kernel fix may not be available right away, we can implement busy reconnecting with a delay in kata-runtime. E.g., a 10ms delay is pretty acceptable from both cpu consumption and speed point of view.

@bergwolf I don't think we can convince each other but adding a configurable delay (e.g. 10ms but it can be set to 0ms if you want to busy wait) to the connect(2) loop is a workable compromise.

I think @justin-he's kernel patch already does what you are looking for? It is available here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b841030625cde5f784dd62aec72d6a766faae70

@stefanha yeah, a configurable delay sounds good to me. Thanks!

Hi~~ Glad we all agree to use a configurable delay. ;)

So as to those host kernel which doesn't have justin's newly merged kernel patch, we probably need to shorten the timeout and also, of course, adding a configurable delay.
Such as we could shorten the connection timeout to 10ms, and fail with another 10ms delay.
But, here is the thing, mdlayher's vsock library doesn't provide API to dial with a time out, it only provides Dial with the default 2s.
We need to add something like this to provide dial with configurable time out:

                fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
                vsa := &unix.SockaddrVM{
                        CID:  cid,
                        Port: port,
                }
                tv := unix.NsecToTimeval(1e6 * vsockConnectTimeoutMs)
                unix.SetsockoptTimeval(fd, unix.SOL_SOCKET, unix.SO_VM_SOCKETS_CONNECT_TIMEOUT, &tv)
                unix.Connect(fd, vsa)

And AFAICT, we couldn't avoid to use mdlayher's vsock library, it provides net.Conn implementation in vsock protocol.

Hi @BetaXOi Seems you have it implemented in your private commits BetaXOi@6cd6780, maybe you'd like to also enable it in https://github.com/mdlayher/vsock/ ? ๐Ÿ˜Š

Just an update, the 0b841030625c ("vhost: vsock: kick send_pkt worker once device is started") Linux patch is queued up for stable branches (4.9, 4.14, 4.19, 5.4, 5.6), so I hope next stable releases will contain the fix.