microsoft/go-winio

DialPipe problem with multiple calls / waiting for busy pipe

Closed this issue · 8 comments

Currently there is some work in progress (portainer/portainer#1186) to add mapped named pipe support into Portainer so it will be easier to run it in a Windows container ( docker run -u ContainerAdministrator -it -v //./pipe/docker_engine://./pipe/docker_engine -p 9000:9000 stefanscherer/portainer:insider -H npipe:////./pipe/docker_engine )

We're using winio.DialPipe() to connect to the Docker engine from inside a microsoft/nanoserver-insider container.

At the moment we see problems in Portainer as it sometimes makes multiple requests requested by the web frontend. Here's a WireShark screenshot with the problem.

bildschirmfoto 2017-09-17 um 07 42 23

The two /api/endpoints/1/docker/xx GET requests use separate connections and run in parallel. One of the calls aborts with "open //./pipe/docker_engine: The system cannot find the file specified."

I have seen the check for cERROR_PIPE_BUSY and first thought that createFile() returns cERROR_FILE_NOT_FOUND and aborts immediatelly.

We are using the default timeout right now

    return winio.DialPipe(namedPipePath, nil)

but I also tried giving a 5 second timeout

    timeout := 5000 * time.Millisecond
    return winio.DialPipe(namedPipePath, &timeout)

Digging deeper it seems like the waitNamedPipe() just does not wait long enough or cannot start wait for the pipe.

Adding some Println in DialPipe I now see this running Portainer in the second DialPipe call:

DialPipe createFile err All pipe instances are busy.
DialPipe waitNamedPipe ms 4998
DialPipe waitNamedPipe returns err The system cannot find the file specified.
DialPipe returns err The system cannot find the file specified.

The problem is that waitNamedPipe returns immediatelly and does not wait the given 5000 milliseconds.

How can we make DialPipe more reliable?

It seems the problem only appears inside a container.

I stripped a test down down to this (with some debug Println in go-winio):

package main

import (
	"fmt"
	"time"

	"github.com/StefanScherer/go-winio"
)

func main() {
	namedPipePath := "\\\\.\\pipe\\docker_engine"
	timeout := 5000 * time.Millisecond
	conn1, err1 := winio.DialPipe(namedPipePath, &timeout)
	fmt.Println("err1", err1)
	conn2, err2 := winio.DialPipe(namedPipePath, &timeout)
	fmt.Println("err2", err2)
	if conn1 != nil {
		conn1.Close()
	}
	if conn2 != nil {
		conn2.Close()
	}
}

And running the container I very often see the problem opening the pipe the second time:

$ docker run -u ContainerAdministrator -v //./pipe/docker_engine://./pipe/docker_engine stefanscherer/testpipe:insider
DialPipe createFile err <nil>
err1 <nil>
DialPipe createFile err All pipe instances are busy.
DialPipe waitNamedPipe ms 5000
DialPipe waitNamedPipe returns err The system cannot find the file specified.
DialPipe returns err The system cannot find the file specified.
err2 open \\.\pipe\docker_engine: The system cannot find the file specified.

Only a few times I got

$ docker run -u ContainerAdministrator -v //./pipe/docker_engine://./pipe/docker_engine stefanscherer/testpipe:insider
DialPipe createFile err <nil>
err1 <nil>
DialPipe createFile err <nil>
err2 <nil>

Running the testpipe.exe on the Windows Insider Docker host seems to work every time.

Any update on this?

@StefanScherer @friism do we know when the named pipe mount feature will be released?

@deviantony I will try to get more info at DockerCon today about this.

Maybe @jstarks finds a minute after DockerCon to give us advise if it's fine to merge portainer PR now if go-winio doesn't need an update itself.

Any update on this?

I just stumbled upon this issue in hashicorp/terraform-provider-docker#58.

I hit this on Windows 10 using Docker for Windows. It does not only happen when running inside a container.

I wrote a stress test for the Win32 named pipes API and managed to reproduce the FILE_NOT_FOUND error in the calls to both CreateFile() and WaitNamedPipe(). See this gist for the source code. It's Python rather than Go, but you should be able to recognize the error handling pattern.

Long story short: there is a race condition in the Win32 named pipe API where the CreateFile() and WaitNamedPipe() calls on the client can be invoked while the server has no outstanding calls on ConnectNamedPipe() (e.g. while it's dispatching a connection). The BSD-style sockets API solved this using a listen backlog, but there doesn't seem to be an equivalent in the Win32 named pipes API.

I'm pretty sure the client needs to handle the FILE_NOT_FOUND error in both calls with a short (ideally randomized) pause and a retry to stay resilient to that race condition. That's what I ended up doing in my stress test and it became very stable afterward.

I submitted a pull request with a tentative fix for this issue. The test suite in hashicorp/terraform-provider-docker#58 now passes reliably when run with a build containing this fix.