Sandbox creating is slower when use_vsock enabled

Question

Sandbox creating is slower when use_vsock enabled

BetaXOi opened this issue 5 years ago · 22 comments

Description of problem

Sandbox createing is slower when use vsock than virtio-serial.

Set 'enable_tracing = true' and create 10 kata containers with and without 'use_vsock=true', record the time of 'docker run' and opentracing span.
for i in seq 10; do time docker run --rm --runtime kata-runtime-test -tid busybox sh; sleep 1; done

Expected result

The time of creating shoud be similar.

Actual result

The time of using vsock is slower 2 second almost than virtio-serial.

Answer 1 · 2019-07-29T08:44:24.000Z

The reason is 'connect' be invoked befor vsock is ready, and the timeout is 2 second in vsock kernel module.
#define VSOCK_DEFAULT_CONNECT_TIMEOUT (2 * HZ)

current sequence

Answer 2 · 2019-07-30T02:27:18.000Z

/cc @amshinde ! Good find @BetaXOi

Answer 3 · 2020-04-28T06:23:07.000Z

Hi guys

Are x86_64 still suffering this issue???
We've encountered the same problem on aarch64. The whole boot-up time with vsock enabled is affected by VSOCK_DEFAULT_CONNECT_TIMEOUT.

Reading the discussion under #1918 and https://lists.nongnu.org/archive/html/qemu-devel/2019-12/msg02225.html. we're trying to reverse vsock connection to avoid the time out???
Referring the @stefanha stefan's suggestion here:

This can be done efficiently as follows:
1. kata-runtime listens on a vsock port
2. kata-agent-port=PORT is added to the kernel command-line options
3. kata-agent parses the port number and connects to the host

He also pointed out other ways to avoid this 2s time out:

Userspace APIs to avoid the 2 second wait already exist:

1. The SO_VM_SOCKETS_CONNECT_TIMEOUT socket option controls the connect
   timeout for this socket.

2. Non-blocking connect allows the userspace process to do other things
   while a connection attempt is being made.

So, what's the progress here? ;)
cc @bergwolf @lifupan @gnawux @amshinde @devimc @grahamwhaley @jodh-intel @egernst @justin-he @jongwu

Answer 4 · 2020-04-28T10:28:06.000Z

I still think the direction of the connection should be reversed and the agent should be launched with a kata-agent-port=PORT parameter.

Unfortunately I am currently busy with other work so I will not be able to implement this change.

Answer 5 · 2020-04-29T10:13:49.000Z

@stefanha okay, I'll try to follow your thinking to do the implementation. ;)

Answer 6 · 2020-05-05T07:52:44.000Z

@justin-he sent a patch for Linux vhost-vsock to fix this issue and now it is merged in the Linus' tree.
It will be released with Linux 5.7-rc5.

The patch should fix this issue, but I agree with @stefanha about the direction of the connection.

Answer 7 · 2020-05-05T08:41:29.000Z

The issue with the kernel patch is that it leaves kata-runtime with a loop that consumes 100% CPU when attempting to connect before kata-agent has created the socket and put it into the listen state.

Answer 8 · 2020-05-06T00:54:45.000Z

Besides, kata should not be 100% sure that host kernel must contain https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b841030625cde5f784dd62aec72d6a766faae70. Hence the proposal from @stefanha (reverse connection) still make sense IMO.

Answer 9 · 2020-05-06T04:02:46.000Z

Hi @stefanha
I got a little confused here, even in my method of reversing connection, I just use it to do a quick handshake to avoid old kernel's 2s blocking issue.
I let runtime launch a goroutine to accept one connection from agent, to do a quick handshake, but, still, at the same time, the main routine needs to wait/block until the connection arrived, then we could proceed.
cc @lifupan

Answer 10 · 2020-05-06T05:13:47.000Z

On Wed, May 6, 2020 at 5:02 AM Penny ***@***.***> wrote: I got a little confused here, even in my method of reversing connection, I just use it to do a quick handshake to avoid old kernel's 2s blocking issue. I let runtime launch a goroutine to accept one connection from agent, to do a quick handshake, but, still, at the same time, the main routine needs to wait/block until the connection arrived, then we could proceed.

kata-runtime establishes the connection in vendor/github.com/kata-containers/agent/protocols/client/client.go. virtcontainers/kata_agent.go calls NewAgentClient() to connect to kata-agent. The code modified so that a vsock listen socket is created before QEMU is launched and the port number is passed to kata-agent on the guest kernel command-line. A goroutine isn't necessary when creating the socket but the socket must be put into the listen state so that kata-agent can connect to it. When kata-runtime reaches NewAgentClient() the vsock dialFunc() can use vsock.Accept() instead of vsock.Dial().

Answer 11 · 2020-05-06T05:24:50.000Z

@stefanha thanks for the explaining.
I misunderstood what you said before. you want a total reverse connection, let runtime be server, not agent anymore.
my pr is just a quick handshake started by kata-agent to walk around the 2s timeout, but still agent will be server.
I'll close to issue a new one. ;)

Answer 12 · 2020-05-06T05:35:34.000Z

On Wed, May 6, 2020 at 6:25 AM Penny ***@***.***> wrote: I misunderstood what you said before. you want a total reverse connection, let runtime be server, not agent anymore. my pr is just a quick handshake started by kata-agent to walk around the 2s timeout, but still agent will be server.

Okay, sorry it wasn't clear before. I think the issue with trying to do a quick handshake is that there is a race condition between closing the handshake socket and opening the permanent agent socket. Another process could use the port number in the meantime.

Answer 13 · 2020-05-06T06:53:30.000Z

Let alone the incompatibility of reversing the server/client roles, It would be another attack surface for runtime to listen for guest connections.

What about setting a short connecting timeout and busy loop connecting on the host(runtime) side? That would be much easier to implement and doesn't break existing compatibility.

Answer 14 · 2020-05-06T08:11:23.000Z

On Wed, May 6, 2020 at 7:53 AM Peng Tao ***@***.***> wrote: Let alone the incompatibility of reversing the server/client roles, It would be another attack surface for runtime to listen for guest connections.

Can you be more specific about how this is a new attack surface? The vsock code in the host kernel is already being exercised by every sandbox. The only difference is that the accept(2) code path is currently not being exercised - but in exchange we stop exercising the connect(2) code path when reversing the connection. Is there a significant difference here? When the socket is accepted, kata-runtime should use getpeername(2) to verify that the remote CID matches the expected sandbox VM. This prevents the attack you mentioned. BTW the reverse of this problem also exists right now: if an attacker takes over an unprivileged host process they can run a busy loop that attempts to connect(2) to vsock ports and can win the race with kata-runtime's connect(2). That would give them access to sandboxes that are booting up. The reverse connection is better in this respect: kata-runtime can verify that the expected CID has connected, while there is no socket API that allows kata-agent to verify that it's really kata-runtime that is connecting. So overall I see the security of these approaches as about the same. The difference is that the reverse connection avoid unnecessary sleeps and busy loops.

Answer 15 · 2020-05-06T08:48:21.000Z

Hi @stefanha @bergwolf
Here is one thing about reversing the whole connection i couldn't figure out:
Requests always come from kata-runtime to agent at uncertain time, like the hotplug issue.
So I couldn't figure a way in agent to avoid to set up a server to deal with all sorts of requests.

That's also part of reason of why I just reverse the connection in the first handshake in my earlier commits.

The whole kernel boot up with systemd will maybe cost aound 300ms?? In order to avoid this
300ms total busy loop, how about using non-block mode and poll a while??

Answer 16 · 2020-05-06T08:53:23.000Z

The short lifetime of the runtime determines that it is not suitable for use as a server.

I have implemented the sample code for reverse connection before, but the code is really smell and dirty, so we did not use it inside our project.
BetaXOi@187ccb9
BetaXOi/agent@a111362

Another solution is to do busy waiting in runtime and implement connect timeout control, which is the solution we are currently using.
BetaXOi@6cd6780

Answer 17 · 2020-05-06T09:42:43.000Z

On Wed, May 6, 2020 at 9:53 AM BetaXOi ***@***.***> wrote: The short lifetime of the runtime determines that it is not suitable for use as a server.

Good point! Using the reverse vsock connection all the time requires a proxy process (unless you are using the v2 Shim interface?). I think a hybrid approach is best: kata-agent can use the reverse connection for launch only and also accept incoming connections for cases where kata-runtime is invoked later. This eliminates busy waiting (if spinning) or startup latency (if sleeping). I have implemented the sample code for reverse connection before, but the

code is really smell and dirty, so we did not use it inside our project. ***@***.*** <BetaXOi@187ccb9> ***@***.*** <BetaXOi/agent@a111362> Another solution is to do busy waiting in runtime and implement connect timeout control, which is the solution we are currently using. ***@***.*** <BetaXOi@6cd6780>

What do you think about the hybrid solution?

Answer 18 · 2020-05-06T15:46:26.000Z

@stefanha

Can you be more specific about how this is a new attack surface? The vsock
code in the host kernel is already being exercised by every sandbox. The
only difference is that the accept(2) code path is currently not being
exercised - but in exchange we stop exercising the connect(2) code path
when reversing the connection. Is there a significant difference here?

It is indeed the listen part but not just the kernel code it self, but also the fact that the runtime is listening for a guest connection. The runtime then is exposed for the guest to attack. IIRC when we started the project, it was purposely decided that the agent be the listener to avoid runtime being attacked from inside the guest.

When the socket is accepted, kata-runtime should use getpeername(2) to
verify that the remote CID matches the expected sandbox VM. This prevents
the attack you mentioned.

It is a possible mitigation and does not reduce the fact that the attack surface is increased. While we can solve it w/o introducing new attack surface, we should consider carefully if the new attack surface is really necessary.

And another reason we shouldn't go the reversing direction is that firecracker and cloud hypervisor don't support it. Then we'll end up with two diverging communication model.

if an attacker takes over an unprivileged host process they can run a busy loop that attempts to
connect(2) to vsock ports and can win the race with kata-runtime's connect(2).

That is a different threat model. The threat model Kata is following does not really handle untrusted host. It is the container apps we often do not trust.

The issue with the kernel patch is that it leaves kata-runtime with a loop that consumes 100% CPU when attempting to connect before kata-agent has created the socket and put it into the listen state.

I think the main problem is that the vhost implements its own workqueue but does not support delayed queuing. Is there any plan to fix it on the kernel side? While the kernel fix may not be available right away, we can implement busy reconnecting with a delay in kata-runtime. E.g., a 10ms delay is pretty acceptable from both cpu consumption and speed point of view.

Answer 19 · 2020-05-06T16:35:30.000Z

@bergwolf I don't think we can convince each other but adding a configurable delay (e.g. 10ms but it can be set to 0ms if you want to busy wait) to the connect(2) loop is a workable compromise.

I think @justin-he's kernel patch already does what you are looking for? It is available here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b841030625cde5f784dd62aec72d6a766faae70

Answer 20 · 2020-05-07T02:23:13.000Z

@stefanha yeah, a configurable delay sounds good to me. Thanks!

Answer 21 · 2020-05-07T07:00:45.000Z

Hi~~ Glad we all agree to use a configurable delay. ;)

So as to those host kernel which doesn't have justin's newly merged kernel patch, we probably need to shorten the timeout and also, of course, adding a configurable delay.
Such as we could shorten the connection timeout to 10ms, and fail with another 10ms delay.
But, here is the thing, mdlayher's vsock library doesn't provide API to dial with a time out, it only provides Dial with the default 2s.
We need to add something like this to provide dial with configurable time out:

                fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
                vsa := &unix.SockaddrVM{
                        CID:  cid,
                        Port: port,
                }
                tv := unix.NsecToTimeval(1e6 * vsockConnectTimeoutMs)
                unix.SetsockoptTimeval(fd, unix.SOL_SOCKET, unix.SO_VM_SOCKETS_CONNECT_TIMEOUT, &tv)
                unix.Connect(fd, vsa)

And AFAICT, we couldn't avoid to use mdlayher's vsock library, it provides net.Conn implementation in vsock protocol.

Hi @BetaXOi Seems you have it implemented in your private commits BetaXOi@6cd6780, maybe you'd like to also enable it in https://github.com/mdlayher/vsock/ ? 😊

Answer 22 · 2020-05-07T10:24:54.000Z

Just an update, the 0b841030625c ("vhost: vsock: kick send_pkt worker once device is started") Linux patch is queued up for stable branches (4.9, 4.14, 4.19, 5.4, 5.6), so I hope next stable releases will contain the fix.