woodpecker-ci/woodpecker

Proxies (NGINX and HAProxy) return 504's on the `/proto.Woodpecker/Next` gRPC route between agent and server

Opened this issue · 4 comments

Component

server, agent

Describe the bug

I configured NGINX as a gRPC reverse proxy with TLS offloading between the agents and the server. However, the NGINX logs quickly are filled with lines complaining about timeouts:

upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: example.org, request: "POST /proto.Woodpecker/Next HTTP/2.0", upstream: "grpc://127.0.0.1:3002", host: "example.org"

Which means the access log of NGINX will log 504's being sent back to the clients:

[02/Dec/2024:13:27:57 +0000] "POST /proto.Woodpecker/Next HTTP/2.0" 504 167 "-" "grpc-go/1.65.0"

From the debug logs of NGINX it can be determined that the timeouts happen after 60s.
Setting WOODPECKER_KEEPALIVE_TIME=10s on the agent to try and keep the connection open does nothing.

Steps to reproduce

  1. Install the latest Woodpecker server. Let it listen to gRPC on port 3002.
  2. Configure an NGINX reverse proxy to perform gRPC TLS offloading as follows:
upstream wp {
  server 127.0.0.1:3002;
}

server {
  listen 443 ssl http2;
  listen [::]:443 ssl http2;

  server_name example.org;

  ssl_certificate /etc/example.org/fullchain.pem;
  ssl_certificate_key /etc/example.org/privkey.pem;

  location / {
    grpc_pass grpc://wp;
  }
}
  1. Configure a Woodpecker agent with WOODPECKER_SERVER=example.org, and WOODPECKER_GRPC_SECURE=true.
  2. Notice the Woodpecker agent can connect to the server and take tasks from the queue successfully, but there are frequent (~1 minute interval) 504's in the access.log and upstream timeouts in the error.log of NGINX.

Expected behavior

The agent appears to be polling for new tasks on the /proto.Woodpecker/Next route to the server. The implementation of this form of long-polling is fragile, as an intermediary infra-component like NGINX in the path of the request can terminate connections if they are too long lived.

I would have expected that the WOODPECKER_KEEPALIVE_TIME argument on the agent would prevent this from happening, but it does not keep the connection alive when used.

System Info

{"source":"https://github.com/woodpecker-ci/woodpecker","version":"2.7.3"}

Additional context

No response

Validations

  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

I don't see the relevance of that linked PR @zc-devs, as none of the variables removed by that PR refer to timeouts at all.

Sorry, I thought this change was a cause.

Setting WOODPECKER_KEEPALIVE_TIME=10s on the agent to try and keep the connection open does nothing.

WOODPECKER_KEEPALIVE_TIME argument on the agent ... does not keep the connection alive when used.

For the record, I tried the same setup using HAproxy, to exclude it being an Nginx-related bug, and I got the same result:

Dec 07 18:57:37 test-vm haproxy[353712]: x.x.x.x:42374 [07/Dec/2024:18:56:47.682] y.y.y.y~ grpc_wp/wp 0/0/0/-1/50002 504 198 - - sH-- 1/1/0/0/0 0/0 "POST https://y.y.y.y:5001/proto.Woodpecker/Next HTTP/2.0"
Dec 07 18:57:37 test-vm woodpecker-agent[353750]: {"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 504 (Gateway Timeout); transport: received unexpected content-type \"text/html\"","time":"2024-12-07T18:57:37Z","message":"grpc error: next(): code: Unavailable"}