golang/go

net/http: Dial I/O Timeout even when request context is not canceled

Opened this issue · 17 comments

gyuho commented

Seems related to 869e576 (c.f. kubernetes-sigs/aws-encryption-provider#61).

Maybe this is an expected behavior... Is it possible to see dial timeouts even when the request context has not been canceled? I am using https://golang.org/pkg/net/http/#RoundTripper with dial timeout 30 seconds and request context is 300µs. If this is possible, can you help understand why this happens?

What version of Go are you using (go version)?

$ go version

go version go1.13.6 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/leegyuho/Library/Caches/go-build"
GOENV="/Users/leegyuho/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB="
"
GOOS="darwin"
GOPATH="/Users/leegyuho/go"
GOPRIVATE="*"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/y_/_dn293xd5kn7xlg6jvp7jpmxs99pm9/T/go-build128444364=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

package main

import (
	"context"
	"fmt"
	"net/http"
	"net/http/httptest"
	"strings"
	"time"
)

func main() {
	// this happens about 50% of the time, so try it a few times.
	for i := 0; i < 10; i++ {
		fmt.Println("trying", i)
		try()
	}
}

func try() {
	mux := http.NewServeMux()
	mux.HandleFunc("/test", func(w http.ResponseWriter, req *http.Request) {
		switch req.Method {
		case "GET":
			fmt.Fprint(w, `test`)
		default:
			http.Error(w, "Method Not Allowed", 405)
		}
	})

	ts := httptest.NewServer(mux)
	defer ts.Close()

	u := ts.URL + "/test"

        // default dial timeout is 30-second
        // https://golang.org/pkg/net/http/#RoundTripper
	cli := http.DefaultClient

	timeout := 300 * time.Microsecond
	reqs := 20
	errc := make(chan error, reqs)
	for i := 0; i < reqs; i++ {
		go func() {
			ctx, cancel := context.WithTimeout(context.TODO(), timeout)
			defer cancel()

			req, err := http.NewRequest(http.MethodGet, u, nil)
			if err != nil {
				errc <- err
				return
			}
			_, err = cli.Do(req.WithContext(ctx))
			if err != nil {
				// can be: ctx.Err() == nil && err == "i/o timeout"
				// Q. but how is that possible?
				fmt.Println("Do failed with", err, "/ context error:", ctx.Err())
			}
			errc <- err
		}()
	}

	// "context deadline exceeded" for requests after timeout
	exp := `context deadline`
	for i := 0; i < reqs; i++ {
		err := <-errc
		if err == nil {
			continue
		}
		fmt.Println("error:", err)
		if !strings.Contains(err.Error(), exp) {
			panic(fmt.Sprintf("#%d: expected %q, got %v", i, exp, err))
		}
	}
}

What did you expect to see?

No panic.

What did you see instead?

trying 0
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
error: Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
Do failed with Get http://127.0.0.1:54098/test: context deadline exceeded / context error: context deadline exceeded
Do failed with Get http://127.0.0.1:54098/test: context deadline exceeded / context error: context deadline exceeded
Do failed with Get http://127.0.0.1:54098/test: context deadline exceeded / context error: context deadline exceeded
Do failed with Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout / context error: <nil>
panic: #0: expected "context deadline", got Get http://127.0.0.1:54098/test: dial tcp 127.0.0.1:54098: i/o timeout

Hello @gyuho, thank you for the question and welcome to the Go project!

So the context is meant to control the lifetime of a request and any operations that may need to be controlled by a user. The external view of applications that use a context is to end their operations when context.Done() receives a result and then the corresponding error is checked with context.Err(). Therefore by passing in a timeout of 300us, your code is asking consumers of the context to not process anything after that very short deadline, thus your dial I/O timeout.

We document that the context controls the lifetime of the request as per:

I'd highly recommend taking a look at the context blog post https://blog.golang.org/context as well.
Please let me know if this makes sense.

Thank you.

gyuho commented

@odeke-em

When context has not been canceled "yet" AND the dial i/o timeout in http.DefaultClient is greater than 300µs (which is 30sec), I would not expect the client to return dial timeout errors. The only expected error here should be "context deadline exceeded".

I am using the context as:

cli.Do(req.WithContext(ctx))

I see this context being used in http.(*Transport).getConn, and then in DialContext via http.(*Transport).dial. And I would like to know how DialContext would return i/o timeout when dial timeout has not been elapsed and context has not been canceled.

When context has not been canceled "yet" AND the dial i/o timeout in http.DefaultClient is greater than 300µs (which is 30sec), I would not expect the client to return dial timeout errors.
And I would like to know how DialContext would return i/o timeout when dial timeout has not been elapsed and context has not been canceled.

300 microseconds is a very short time and when dialling even local connections this can timeout at times, (I've seen some syscalls in the magnitude microseconds and local RPCs on very fast local networks taking about 100 microseconds). That context is being passed directly to net.Dialer.DialContext when using http.DefaultClient.

Here is a repro that'll return the I/O timeout and performs a dial directly to a plainly written server that just spits out HTTP/1.1

package main

import (
	"bufio"
	"context"
	"log"
	"net"
	"net/http"
	"net/http/httputil"
	"time"
)

func main() {
	ln, err := net.Listen("tcp", ":0")
	if err != nil {
		log.Fatal(err)
	}
	defer ln.Close()

	go func() {
		for {
			conn, err := ln.Accept()
			if err != nil {
				log.Fatalf("Failed to accept connection: %v", err)
			}
			conn.Write([]byte("HTTP/1.1 200 OK\r\nHost: foo.golang.org\r\nConnection: keep-alive\r\nContent-Length: 10\r\n\r\nabcdefghij\r\n"))

		}
	}()

	var dialer net.Dialer

	for {
		func() {
			ctx, cancel := context.WithTimeout(context.Background(), 300*time.Microsecond)
			defer cancel()

			conn, err := dialer.DialContext(ctx, "tcp", ln.Addr().String())
			if err != nil {
				log.Fatalf("Failed to read response: %v\nErr isTimeout? %t", err, err.(net.Error).Timeout())
			}
			defer conn.Close()
			br := bufio.NewReader(conn)
			res, err := http.ReadResponse(br, nil)
			if err != nil {
				log.Fatalf("Failed to read response: %v", err)
			}
			blob, _ := httputil.DumpResponse(res, true)
			_ = res.Body.Close()
			println(string(blob) + "\n\n")
		}()

		<-time.After(800 * time.Millisecond)
	}
}

and when run will show you the I/O timeout on the second request but not the first

$ go run dial.go 
HTTP/1.1 200 OK
Content-Length: 10
Connection: keep-alive
Host: foo.golang.org

abcdefghij


2020/01/30 10:43:17 Failed to read response: dial tcp [::]:54107: i/o timeout
Err isTimeout? true
exit status 1

I am currently on a long haul flight but when I get home in about 12 hours, I'll take a look at your response, but really perhaps please examine the timeouts that you are passing into the context which controls the lifetime of the request, dials as well as responses.

gyuho commented

@odeke-em I am aware 300µs is very short, even for local test server. Your code doesn't reproduce my issue.

My confusion is:

  1. whether 300µs timeout context is being used for http client dial call (it seems so)
  2. if that's the case, how is it possible to receive dial tcp i/o timeout even when the context has not been canceled (see where my code panics above)
gyuho commented

Btw, thanks for helping on this!

@gyuho you may get more/quicker help by posting to golang-nuts.

  1. whether 300µs timeout context is being used for http client dial call (it seems so)

Yes, it is being passed in for the dial and is documented as controlling the lifetime of the request. If the dial happens fast enough, you'll get a result back, otherwise you'll get an I/O timeout, which I tried to illustrate with the net dialling snippet above, but that happens non-deterministically.

  1. if that's the case, how is it possible to receive dial tcp i/o timeout even when the context has not been canceled (see where my code panics above)

For a context, as I had mentioned there are 2 ways that it can be rendered moot: 1. If the deadline is exceeded, 2. If cancel is invoked. In your case the deadline is what controls how long the operation is expected to be alive for, and in that short duration the dial wasn't completed hence the "dial TCP I/O timeout", when passed into the net.Dialer as per

That context is being passed directly to net.Dialer.DialContext when using http.DefaultClient.

@gyuho you may get more/quicker help by posting to golang-nuts.

@networkimprov while this is true, it rubs off wrong and sounds a bit dismissive of others' efforts. Please try to assume best/good intent and also try to be more productive when replying to answers, as we are working on providing clarity to their answer, with my own time.

Btw, thanks for helping on this!

Glad to be of help as much as I can :)

Thanks!

@odeke-em I think you misread my comment. This appears to be a question, not a bug report, and usually those are referred to the forums so as not to overburden folks like you!

Sorry for jumping on this one here but it seems to me this could highlight some kind of inconsistency.

I have a similar issue when performing multiple GET request, each from a separate go routine, each with a context derived from a parent context with a deadline of 2.5 seconds. If the machine is put under heavy load, some of the requests start failing with a deadline exceeded error, and some with i/o timeout which I find misleading since the Timeout set for the dial is 30 seconds (i am using http.DefaultClient and http.DefaultTransport).

I think it would be more consistent to have a deadline exceeded error in this case since the Dial was interrupted because of the deadline set instead of a timeout on a dial as the error would suggest. Does this make sense?

(Sorry if it doesn't, first time commenting on an issue, please be kind :D)

In our case this made the difference between investigating a network issue and a performance issue.

(Edited to add comment about impact on investigation)

@giannimassi could you post a runnable code example for the scenario you describe?

@networkimprov Yes, I should've done it in the first place, thanks for asking.
The following code prints errors that are not deadline exceeded for each request done concurrently as explained in the previous comment.

package main

import (
	"context"
	"fmt"
	"net"
	"net/http"
	"strconv"
	"strings"
	"sync"
	"time"
)

const (
	n       = 1000
	timeout = 10000 * time.Microsecond
)

func main() {
	ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(timeout))
	defer cancel()
	doConcurrentRequests(ctx, n)
}

func doConcurrentRequests(ctx context.Context, n int) {
	// Explicitly configure Dial Timeout to 30 seconds
	// (this is not necessary since the default client already has this configuration)
	transport := http.DefaultTransport
	transport.(*http.Transport).DialContext = (&net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second,
		DualStack: true,
	}).DialContext
	client := http.DefaultClient
	client.Transport = transport

	wg := sync.WaitGroup{}
	for i := 0; i < n; i++ {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()
			now := time.Now()
			err := doRequest(client, ctx)
			if err != nil && !strings.Contains(err.Error(), "deadline") {
				fmt.Println(strconv.Itoa(id) + " - Request took:" + time.Since(now).String() + " " + err.Error())
			}
		}(i)
	}
	wg.Wait()
}

func doRequest(client *http.Client, ctx context.Context) error {
	req, err := http.NewRequest("GET", "http://example.com", nil)
	if err != nil {
		return fmt.Errorf("while making new request: %w", err)
	}

	_, err = client.Do(req.WithContext(ctx))
	if err != nil {
		return fmt.Errorf("while executing request: %w", err)
	}
	return nil
}

(Btw I know this is far from efficient code 😄 )

arl commented

@giannimassi trying your script on go1.13.4 linux/amd64 with n=1000 and timeout=10s doesn't produce the unexpected error message.
If I increase n, the only error is EMFILE (i.e too many opened files).
So this might be platform-dependent

@arl makes sense. Might be some syscall return value interpretation issue. It could also be the fact that the behaviour requires different parameters, with the same parameters the behaviour changes a lot on my machine.

For the record this is an example output with go version go1.13.8 darwin/amd64:

706 - Request took:8.420581ms while executing request: Get http://example.com: dial tcp [2606:2800:220:1:248:1893:25c8:1946]:80: i/o timeout
76 - Request took:8.604937ms while executing request: Get http://example.com: dial tcp [2606:2800:220:1:248:1893:25c8:1946]:80: i/o timeout

I'm hoping to find some time to investigate this a bit more this weekend, and maybe make my first attempt at a fix in the standard library, although it doesn't look like the simplest piece of code (hints/help/feedback appreciated @networkimprov @cagedmantis).

jonaz commented

Can this be selinux related? I see "i/o timeout" instead of expected context deadline when running on selinux enabled centos 8. On my local machine i get deadline exceeded.

any update?