502 when under load
Closed this issue · 5 comments
When under load, nginx sporadically returns 502 responses
Repro
(Linux, docker driver, minikube w/ ingress addon)
substra login
for i in `seq 200`; do
substra get traintuple $i &
done
In the logs
2020/07/13 17:44:56 [error] 2217#2217: *672834 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.17.0.1, server: substra-backend.node-1.com, request: "GET /traintuple/184/ HTTP/1.1", upstream: "http://172.18.0.47:8000/traintuple/184/", host: "substra-backend.node-1.com"
[...]
172.17.0.1 - - [13/Jul/2020:17:44:56 +0000] "GET /traintuple/184/ HTTP/1.1" 400 26 "-" "python-requests/2.24.0" 260 0.027 [org-1-backend-org-1-substra-backend-server-http] [] 172.18.0.47:8000, 172.18.0.47:8000 0, 26 0.000, 0.024 502, 400 be8dae30c7f82749a8b130ddf459875f
For 200 consecutive requests, I consistently get 1-4 "502" responses.
Interestingly, the first time I run the test, I only get one 502. When I re-run the test, I get 3-4 502s. I then keep on getting 3-4 502s in subsequent tests. This might be related to the fact that we currently use the cheaper
algorithm.
Nginx retry
Note that in the example above, nginx tries hitting the backend twice. See the end of the second log line:
502, 400 be8dae30c7f82749a8b130ddf459875f
Explanation:
- nginx tries to hit the backend but the connection gets interrupted => 502
- nginx tries to hit the backend again, this time the call succeed => 400 (this is the correct, expected return code for this test)
I have no explanation as to why the request is retried. My understanding is that this shouldn't happen since there's only one backend server configured (confirmed with kubectl ingress-nginx backends -n kube-system
)
I sometimes get "Bad Gateway" (502) errors when running the tests in substra-tests locally, it does not always happen to the same one I think.
My installation of Substra is with skaffold (see the installation instructions)
My setup is:
- macOS Catalina 10.15.5
- docker desktop community, 2.3.0.3 with kubernetes 1.16.5
This might be related to the fact that we currently use the cheaper algorithm.
Not sure, I added cheaper worker to solve this issue I was facing before adding them :(
I have no explanation as to why the request is retried. My understanding is that this shouldn't happen since there's only one backend server configured (confirmed with kubectl ingress-nginx backends -n kube-system)
This is a feature of nginx to retry after receiving some 500's error codes (doc: nginx#proxy_next_upstream) as long as they are not POST
, PUT
or something else that is not idempotent.
If you have only one pod and proxy-next-upstream-tries
is set to 3 (default) it will try three times the same server (this is what I understood from this thread: kubernetes/ingress-nginx#4944), this could explain why you have a retry even if you only have one server running.
here in the default nginx-ingress code: https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/config/config.go#L784-L786 you can see it retries only on timeout.
Closing stale issue.