deislabs/osiris

Failed requests with Osiris proxy

vbehar opened this issue · 3 comments

Environment:

  • Kubernetes distribution (e.g. AKS, ACS Engine, GKE, EKS, etc.): GKE & docker for mac
  • Kubernetes version (use kubectl version):

for GKE:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:00Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-gke.0", GitCommit:"569511c9540f78a94cc6a41d895c382d0946c11a", GitTreeState:"clean", BuildDate:"2019-08-21T23:28:44Z", GoVersion:"go1.11.13b4", Compiler:"gc", Platform:"linux/amd64"}

for docker-for-mac:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:00Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Osiris version (specify by SHA or semver): bb78180 & 472cbec
  • Install method (specify exact helm install command used):
helm install osiris/osiris-edge --name osiris --devel

What happened?

We used to run on GKE with an old version of Osiris - bb78180, before the http2 PR. Everything worked nicely for months.
We recently upgraded to a more recent commit to benefit from recent changes - 472cbec, after the http2 PR. And now we have failed requests from time to time. Rollbacking to the previous version fixed the issue.

What you expected to happen?

no failed requests.

How to reproduce it (as minimally and precisely as possible):

on docker-for-mac:

$ helm install osiris/osiris-edge --name osiris --devel
$ kubectl apply -f example/hello-osiris.yaml
$ ab -n 1000 -c 5 http://localhost:8080/

and then rollback to an older version:

$ helm upgrade --reset-values osiris osiris/osiris-edge --version=0.0.1-2018.12.21.23.19.56-bb78180
$ kubectl scale deployment hello-osiris --replicas 0
$ kubectl scale deployment hello-osiris --replicas 1
$ ab -n 1000 -c 5 http://localhost:8080/

Anything else that we need to know?

Output of the ab run with commit 472cbec (broken version):

This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        13 bytes

Concurrency Level:      5
Time taken for tests:   0.849 seconds
Complete requests:      1000
Failed requests:        20
   (Connect: 0, Receive: 0, Length: 20, Exceptions: 0)
Total transferred:      127400 bytes
HTML transferred:       12740 bytes
Requests per second:    1177.77 [#/sec] (mean)
Time per request:       4.245 [ms] (mean)
Time per request:       0.849 [ms] (mean, across all concurrent requests)
Transfer rate:          146.53 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     1    4   1.6      4      14
Waiting:        0    4   1.6      3      14
Total:          2    4   1.6      4      14

Percentage of the requests served within a certain time (ms)
  50%      4
  66%      4
  75%      5
  80%      5
  90%      6
  95%      7
  98%      9
  99%     10
 100%     14 (longest request)

Notice the 20 failed requests.

Logs of the osiris-proxy container on the hello-osiris pod:

I1014 07:22:31.007410       1 proxy.go:12] Starting Osiris Proxy -- version devel -- commit 472cbec
I1014 07:22:31.007683       1 proxy.go:115] Healthz and metrics server is listening on :5004

Logs of the hello-osiris container:

2019/10/14 07:22:30 Listening for HTTP/1.x without TLS on :8080
2019/10/14 07:22:30 Listening for h2c (HTTP/2 without TLS) on :8081
2019/10/14 07:22:30 Listening for HTTPS (HTTP/1.x OR HTTP/2 with TLS) on :4430
2019/10/14 07:22:30 Listening for insecure gRPC (no TLS) on :8082
2019/10/14 07:22:30 Note: Due to limitations of SNI, Osiris only supports one TLS-enabled port per application, so this example does not demonstrate gRPC with TLS, although this combination should work.
2019/10/14 07:22:42 Received: GET / HTTP/1.1 (without TLS)
...
$ kubectl logs hello-osiris-6fdc75d555-l8dxj hello-osiris | grep "Received: GET / HTTP/1.1 (without TLS)" | wc -l
    1000

Output of the proxy metrics on :5004/metrics:

{
    "proxyId": "9ac00e99-ec38-4fda-8ea2-f41a26a602d6",
    "connectionsOpened": 1000,
    "connectionsClosed": 1000
}

Output of the ab run with commit bb78180 (good old version):

This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        13 bytes

Concurrency Level:      5
Time taken for tests:   0.881 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      130000 bytes
HTML transferred:       13000 bytes
Requests per second:    1134.97 [#/sec] (mean)
Time per request:       4.405 [ms] (mean)
Time per request:       0.881 [ms] (mean, across all concurrent requests)
Transfer rate:          144.09 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       2
Processing:     2    4   2.0      4      22
Waiting:        1    4   1.8      3      20
Total:          2    4   2.0      4      22

Percentage of the requests served within a certain time (ms)
  50%      4
  66%      5
  75%      5
  80%      5
  90%      6
  95%      8
  98%     10
  99%     12
 100%     22 (longest request)

I'll try to investigate more, but I just wanted to write a bug report first, maybe you'll have an idea of the issue.

btw is there a reason for still using go 1.11 ? I saw in the source code that there is a copy of the httputil pkg from go 1.12

what I found so far: when I hit the issue with curl, I have:

*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
GET / HTTP/1.1
Host: localhost:8080
User-Agent: curl/7.54.0
Accept: application/json, */*

* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server

and if I add a time.Sleep(10*time.Second) for example in the singleConnectionListener.Accept func for the 2nd call to Accept, the test with the 1000 requests using ab works. 0 failed request:

func (s *singleConnectionListener) Accept() (net.Conn, error) {
	var c net.Conn
	s.once.Do(func() {
		c = s.conn
	})
	if c != nil {
		return c, nil
	}

	time.Sleep(10 * time.Second) // wait a little before returning

	return nil, io.EOF
}

I'll need to understand why waiting before returning from the Accept func fixes the issue...

not closing the connection in dynamicProxy.ListenAndServe also fixes the issue. I found out that closing the connection returns a ErrNetClosing error. And if I don't close it, it seems like it's getting closed by something else, because it doesn't hold on any resources (I integrated prometheus in the proxy to make sure that we don't leak goroutines or file descriptors by not closing the conn)