Proxies (NGINX and HAProxy) return 504's on the `/proto.Woodpecker/Next` gRPC route between agent and server
Opened this issue · 4 comments
Component
server, agent
Describe the bug
I configured NGINX as a gRPC reverse proxy with TLS offloading between the agents and the server. However, the NGINX logs quickly are filled with lines complaining about timeouts:
upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: example.org, request: "POST /proto.Woodpecker/Next HTTP/2.0", upstream: "grpc://127.0.0.1:3002", host: "example.org"
Which means the access log of NGINX will log 504's being sent back to the clients:
[02/Dec/2024:13:27:57 +0000] "POST /proto.Woodpecker/Next HTTP/2.0" 504 167 "-" "grpc-go/1.65.0"
From the debug logs of NGINX it can be determined that the timeouts happen after 60s.
Setting WOODPECKER_KEEPALIVE_TIME=10s
on the agent to try and keep the connection open does nothing.
Steps to reproduce
- Install the latest Woodpecker server. Let it listen to
gRPC
on port3002
. - Configure an NGINX reverse proxy to perform
gRPC
TLS offloading as follows:
upstream wp {
server 127.0.0.1:3002;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name example.org;
ssl_certificate /etc/example.org/fullchain.pem;
ssl_certificate_key /etc/example.org/privkey.pem;
location / {
grpc_pass grpc://wp;
}
}
- Configure a Woodpecker agent with
WOODPECKER_SERVER=example.org
, andWOODPECKER_GRPC_SECURE=true
. - Notice the Woodpecker agent can connect to the server and take tasks from the queue successfully, but there are frequent (~1 minute interval) 504's in the
access.log
and upstream timeouts in theerror.log
of NGINX.
Expected behavior
The agent appears to be polling for new tasks on the /proto.Woodpecker/Next
route to the server. The implementation of this form of long-polling is fragile, as an intermediary infra-component like NGINX in the path of the request can terminate connections if they are too long lived.
I would have expected that the WOODPECKER_KEEPALIVE_TIME
argument on the agent would prevent this from happening, but it does not keep the connection alive when used.
System Info
{"source":"https://github.com/woodpecker-ci/woodpecker","version":"2.7.3"}
Additional context
No response
Validations
- Read the docs.
- Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
- Checked that the bug isn't fixed in the
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
I don't see the relevance of that linked PR @zc-devs, as none of the variables removed by that PR refer to timeouts at all.
Sorry, I thought this change was a cause.
Setting WOODPECKER_KEEPALIVE_TIME=10s on the agent to try and keep the connection open does nothing.
WOODPECKER_KEEPALIVE_TIME argument on the agent ... does not keep the connection alive when used.
For the record, I tried the same setup using HAproxy, to exclude it being an Nginx-related bug, and I got the same result:
Dec 07 18:57:37 test-vm haproxy[353712]: x.x.x.x:42374 [07/Dec/2024:18:56:47.682] y.y.y.y~ grpc_wp/wp 0/0/0/-1/50002 504 198 - - sH-- 1/1/0/0/0 0/0 "POST https://y.y.y.y:5001/proto.Woodpecker/Next HTTP/2.0"
Dec 07 18:57:37 test-vm woodpecker-agent[353750]: {"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 504 (Gateway Timeout); transport: received unexpected content-type \"text/html\"","time":"2024-12-07T18:57:37Z","message":"grpc error: next(): code: Unavailable"}