Substra/substra-backend

502 when under load

Closed this issue · 5 comments

When under load, nginx sporadically returns 502 responses

Repro

(Linux, docker driver, minikube w/ ingress addon)

substra login

for i in `seq 200`; do 
    substra get traintuple $i & 
done

In the logs

2020/07/13 17:44:56 [error] 2217#2217: *672834 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.17.0.1, server: substra-backend.node-1.com, request: "GET /traintuple/184/ HTTP/1.1", upstream: "http://172.18.0.47:8000/traintuple/184/", host: "substra-backend.node-1.com"
[...]
172.17.0.1 - - [13/Jul/2020:17:44:56 +0000] "GET /traintuple/184/ HTTP/1.1" 400 26 "-" "python-requests/2.24.0" 260 0.027 [org-1-backend-org-1-substra-backend-server-http] [] 172.18.0.47:8000, 172.18.0.47:8000 0, 26 0.000, 0.024 502, 400 be8dae30c7f82749a8b130ddf459875f

For 200 consecutive requests, I consistently get 1-4 "502" responses.

Interestingly, the first time I run the test, I only get one 502. When I re-run the test, I get 3-4 502s. I then keep on getting 3-4 502s in subsequent tests. This might be related to the fact that we currently use the cheaper algorithm.

Nginx retry

Note that in the example above, nginx tries hitting the backend twice. See the end of the second log line:

502, 400 be8dae30c7f82749a8b130ddf459875f

Explanation:

  • nginx tries to hit the backend but the connection gets interrupted => 502
  • nginx tries to hit the backend again, this time the call succeed => 400 (this is the correct, expected return code for this test)

I have no explanation as to why the request is retried. My understanding is that this shouldn't happen since there's only one backend server configured (confirmed with kubectl ingress-nginx backends -n kube-system)

I sometimes get "Bad Gateway" (502) errors when running the tests in substra-tests locally, it does not always happen to the same one I think.

My installation of Substra is with skaffold (see the installation instructions)

My setup is:

  • macOS Catalina 10.15.5
  • docker desktop community, 2.3.0.3 with kubernetes 1.16.5

This might be related to the fact that we currently use the cheaper algorithm.

Not sure, I added cheaper worker to solve this issue I was facing before adding them :(

I have no explanation as to why the request is retried. My understanding is that this shouldn't happen since there's only one backend server configured (confirmed with kubectl ingress-nginx backends -n kube-system)

This is a feature of nginx to retry after receiving some 500's error codes (doc: nginx#proxy_next_upstream) as long as they are not POST, PUT or something else that is not idempotent.
If you have only one pod and proxy-next-upstream-tries is set to 3 (default) it will try three times the same server (this is what I understood from this thread: kubernetes/ingress-nginx#4944), this could explain why you have a retry even if you only have one server running.
here in the default nginx-ingress code: https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/config/config.go#L784-L786 you can see it retries only on timeout.

Closing stale issue.