Failed requests with Osiris proxy
vbehar opened this issue · 3 comments
Environment:
- Kubernetes distribution (e.g. AKS, ACS Engine, GKE, EKS, etc.): GKE & docker for mac
- Kubernetes version (use
kubectl version
):
for GKE:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:00Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-gke.0", GitCommit:"569511c9540f78a94cc6a41d895c382d0946c11a", GitTreeState:"clean", BuildDate:"2019-08-21T23:28:44Z", GoVersion:"go1.11.13b4", Compiler:"gc", Platform:"linux/amd64"}
for docker-for-mac:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:00Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
- Osiris version (specify by SHA or semver):
bb78180
&472cbec
- Install method (specify exact
helm install
command used):
helm install osiris/osiris-edge --name osiris --devel
What happened?
We used to run on GKE with an old version of Osiris - bb78180
, before the http2 PR. Everything worked nicely for months.
We recently upgraded to a more recent commit to benefit from recent changes - 472cbec
, after the http2 PR. And now we have failed requests from time to time. Rollbacking to the previous version fixed the issue.
What you expected to happen?
no failed requests.
How to reproduce it (as minimally and precisely as possible):
on docker-for-mac:
$ helm install osiris/osiris-edge --name osiris --devel
$ kubectl apply -f example/hello-osiris.yaml
$ ab -n 1000 -c 5 http://localhost:8080/
and then rollback to an older version:
$ helm upgrade --reset-values osiris osiris/osiris-edge --version=0.0.1-2018.12.21.23.19.56-bb78180
$ kubectl scale deployment hello-osiris --replicas 0
$ kubectl scale deployment hello-osiris --replicas 1
$ ab -n 1000 -c 5 http://localhost:8080/
Anything else that we need to know?
Output of the ab
run with commit 472cbec
(broken version):
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 13 bytes
Concurrency Level: 5
Time taken for tests: 0.849 seconds
Complete requests: 1000
Failed requests: 20
(Connect: 0, Receive: 0, Length: 20, Exceptions: 0)
Total transferred: 127400 bytes
HTML transferred: 12740 bytes
Requests per second: 1177.77 [#/sec] (mean)
Time per request: 4.245 [ms] (mean)
Time per request: 0.849 [ms] (mean, across all concurrent requests)
Transfer rate: 146.53 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 1 4 1.6 4 14
Waiting: 0 4 1.6 3 14
Total: 2 4 1.6 4 14
Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 5
80% 5
90% 6
95% 7
98% 9
99% 10
100% 14 (longest request)
Notice the 20
failed requests.
Logs of the osiris-proxy
container on the hello-osiris
pod:
I1014 07:22:31.007410 1 proxy.go:12] Starting Osiris Proxy -- version devel -- commit 472cbec
I1014 07:22:31.007683 1 proxy.go:115] Healthz and metrics server is listening on :5004
Logs of the hello-osiris
container:
2019/10/14 07:22:30 Listening for HTTP/1.x without TLS on :8080
2019/10/14 07:22:30 Listening for h2c (HTTP/2 without TLS) on :8081
2019/10/14 07:22:30 Listening for HTTPS (HTTP/1.x OR HTTP/2 with TLS) on :4430
2019/10/14 07:22:30 Listening for insecure gRPC (no TLS) on :8082
2019/10/14 07:22:30 Note: Due to limitations of SNI, Osiris only supports one TLS-enabled port per application, so this example does not demonstrate gRPC with TLS, although this combination should work.
2019/10/14 07:22:42 Received: GET / HTTP/1.1 (without TLS)
...
$ kubectl logs hello-osiris-6fdc75d555-l8dxj hello-osiris | grep "Received: GET / HTTP/1.1 (without TLS)" | wc -l
1000
Output of the proxy metrics on :5004/metrics
:
{
"proxyId": "9ac00e99-ec38-4fda-8ea2-f41a26a602d6",
"connectionsOpened": 1000,
"connectionsClosed": 1000
}
Output of the ab
run with commit bb78180
(good old version):
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 13 bytes
Concurrency Level: 5
Time taken for tests: 0.881 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 130000 bytes
HTML transferred: 13000 bytes
Requests per second: 1134.97 [#/sec] (mean)
Time per request: 4.405 [ms] (mean)
Time per request: 0.881 [ms] (mean, across all concurrent requests)
Transfer rate: 144.09 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 2
Processing: 2 4 2.0 4 22
Waiting: 1 4 1.8 3 20
Total: 2 4 2.0 4 22
Percentage of the requests served within a certain time (ms)
50% 4
66% 5
75% 5
80% 5
90% 6
95% 8
98% 10
99% 12
100% 22 (longest request)
I'll try to investigate more, but I just wanted to write a bug report first, maybe you'll have an idea of the issue.
btw is there a reason for still using go 1.11 ? I saw in the source code that there is a copy of the httputil
pkg from go 1.12
what I found so far: when I hit the issue with curl
, I have:
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
GET / HTTP/1.1
Host: localhost:8080
User-Agent: curl/7.54.0
Accept: application/json, */*
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server
and if I add a time.Sleep(10*time.Second)
for example in the singleConnectionListener.Accept
func for the 2nd call to Accept
, the test with the 1000 requests using ab
works. 0 failed request:
func (s *singleConnectionListener) Accept() (net.Conn, error) {
var c net.Conn
s.once.Do(func() {
c = s.conn
})
if c != nil {
return c, nil
}
time.Sleep(10 * time.Second) // wait a little before returning
return nil, io.EOF
}
I'll need to understand why waiting before returning from the Accept
func fixes the issue...
not closing the connection in dynamicProxy.ListenAndServe
also fixes the issue. I found out that closing the connection returns a ErrNetClosing
error. And if I don't close it, it seems like it's getting closed by something else, because it doesn't hold on any resources (I integrated prometheus in the proxy to make sure that we don't leak goroutines or file descriptors by not closing the conn)