woodpecker-ci/woodpecker

Agent stops taking jobs after server throws 5XX errors

Opened this issue · 4 comments

Component

agent

Describe the bug

When the server (running in kubernetes) restarts my docker agent refuses to take new jobs until restarted. In the agent logs I can see several 5XX Errors while the server reboots. After that the agent shows as online in the UI but does not take jobs.

Agent logs: See below

Steps to reproduce

  1. Install Woodpecker server in Kubernetes
  2. Install agent in seperate server using docker
  3. Kill the server so that it recreates
  4. Trigger pipeline that would use the docker agent
  5. See it pending

Expected behavior

The agent should properly reconnect to the Server via gRPC after the server restarts.

System Info

Server:
{"source":"https://github.com/woodpecker-ci/woodpecker","version":"2.7.3"}

Helm values:

---
server:
  ingress:
    # -- Enable the ingress for the server component
    enabled: true
    # -- Add annotations to the ingress
    annotations:
      # kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
    hosts:
      - host: woodpecker.example.com
        paths:
          - path: /
            backend:
              serviceName: woodpecker-svc
              servicePort: 80
    tls:
      - hosts:
          - woodpecker.example.com
        secretName: woodpecker-tls-key
  statefulSet:
    replicaCount: 1
  env:
    WOODPECKER_ADMIN: 'aaron'
    WOODPECKER_HOST: 'https://woodpecker.example.com'
    WOODPECKER_OPEN: true
    WOODPECKER_FORGEJO: true
    WOODPECKER_FORGEJO_URL: 'https://git.example.com'
    WOODPECKER_LOG_LEVEL: "error"
  extraSecretNamesForEnvFrom:
    - woodpecker-forgejo

gRPC Ingress:

---
apiVersion: v1
kind: Service
metadata:
  name: woodpecker-grpc
  namespace: woodpecker
  annotations:
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
spec:
  selector:
    app.kubernetes.io/instance: woodpecker
    app.kubernetes.io/name: server
  ports:
    - name: grpc
      protocol: TCP
      port: 9000
      targetPort: grpc
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/tls-acme: "true"
    traefik.ingress.kubernetes.io/loadbalancer.server.scheme: h2c
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
  name: woodpecker-grpc
  namespace: woodpecker
spec:
  rules:
    - host: "woodpecker-grpc.apps.example.com"
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: woodpecker-grpc
                port:
                  name: grpc
  tls:
    - hosts:
        - woodpecker-grpc.apps.example.com
      secretName: woodpecker-grpc-tls-key

docker-compose config for agent:

services:
  woodpecker-agent-1:
    container_name: woodpecker-agent-1
    image: woodpeckerci/woodpecker-agent:latest
    command: agent
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WOODPECKER_SERVER=woodpecker-grpc.apps.example.com:443
      - WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET}
      - WOODPECKER_MAX_WORKFLOWS=4
      - WOODPECKER_FILTER_LABELS="backend=docker"
      - WOODPECKER_BACKEND_DOCKER_ENABLE_IPV6=true
      - WOODPECKER_GRPC_SECURE=true
      - WOODPECKER_GRPC_VERIFY=true
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

Additional context

Agent logs:

{"level":"info","time":"2024-11-23T08:44:52Z","message":"starting Woodpecker agent with version '2.7.3' and backend 'docker' using platform 'linux/amd64' running up to 4 pipelines in parallel"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:26:59Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:01Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:02Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:04Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:06Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:12Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:19Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Validations

  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

Does it work if you deploy an agent in Kubernetes (direct Agent-Server connection, not via Traefik)?

JFYI, that is my IngressRoute, which worked a couple of months ago:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: woodpecker-server
spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`wp.domain.tld`)
    services:
    - name: woodpecker-server
      port: http
  - kind: Rule
    match: Host(`wp.domain.tld`) && Headers(`Content-Type`, `application/grpc`)
    services:
    - name: woodpecker-server
      port: grpc
      scheme: h2c

However, I didn't restarted the server, if I remember correctly.

The kubernetes-agents work fine and are not affected by the problem. It is very likely that the 5XX errors come from Traefik mainly. However I would also expect the agent to not poop itself when there are errors for a few seconds.

Matching the application type is a good hint, I might implement this. I currently don't use IngressRoute objects and instead configure normal Ingresses with annotations.

received unexpected content-type "text/plain; charset=utf-8""
errors come from Traefik

I think so and I had this.

The agent should properly reconnect

{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Seems, it is trying.


Do you have 2 ingresses: one for HTTP, another for gRPC? Could you show HTTP one?

pat-s commented

Accidentally added the label. Can't remove it anymore :/