mtrudel/bandit

Issue with K8s liveness probes

Sgiath opened this issue · 13 comments

Hi, I am a fan of the project, so I have run a relatively simple Phoenix in production with Bandit since September 2023 (and I have regularly updated the Bandit version in that time). The app used Elixir release inside Docker, which was running in the K8s cluster.

In December 2023, we noticed that the pod was restarted quite regularly without any apparent issue in the logs. After further investigation, we discovered it had been behaving like that from the beginning, and we just noticed it in December. We have determined that the issue is that sometimes the liveness or readiness probes in K8s are timing out, so the K8s decide to restart the pod to make it available again. We had multiple theories about why that happens, but since it didn't happen on our older services, we have tried switching to Cowboy since it was a very quick change. This immediately fixed the issue, and we didn't have any restarts since then.

I am not sure what the issue is since there were no logs in the app; we found out only by inspecting K8s logs, and from them, we have just the info that the request timed out. I am 100 % positive that switching to Cowboy was the only change to fix it. So I did not plan to report it (since I am not even sure what was wrong and I don't have an easy way to reproduce it), but since Bandit is now the new default for Phoenix, I am reporting it now so you are aware of this issue we had with Bandit in production.

The app uses LiveView, and here is the excerpt from the K8s config we have used:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: auth
  name: auth
  namespace: prod
spec:
  replicas: 4
  selector:
    matchLabels:
      app: auth
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: auth
    spec:
      serviceAccountName: auth
      containers:
        - name: auth
          image: auth:production-master-3252e5c
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 4040
              name: http
              protocol: TCP

          startupProbe:
            httpGet:
              path: /__startup
              port: http
            periodSeconds: 3
            failureThreshold: 10

          livenessProbe:
            httpGet:
              path: /__liveness
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 2

          readinessProbe:
            httpGet:
              path: /__readiness
              port: http
            initialDelaySeconds: 5
            periodSeconds: 1
            failureThreshold: 2

          resources:
            requests:
              memory: "225Mi"
              cpu: "30m"
            limits:
              memory: "768Mi"
              cpu: "500m"

and this is the implementation of health endpoints (it is included as first plug in the Endpoint module):

defmodule Web.Plug.Health do
  @behaviour Plug

  import Plug.Conn, only: [halt: 1, send_resp: 3]

  alias Auth.Core.Repo
  alias Ecto.Adapters.SQL
  alias FT.K8S.TrafficDrainHandler

  require Logger

  @impl Plug
  def init(opts), do: opts

  @impl Plug
  def call(%Plug.Conn{} = conn, _opts) do
    case conn.request_path do
      "/__startup" -> health_response(conn, started?())
      "/__liveness" -> health_response(conn, alive?())
      "/__health" -> health_response(conn, alive?())
      "/__readiness" -> health_response(conn, ready?())
      _other -> conn
    end
  end

  defp health_response(conn, true) do
    conn
    |> send_resp(200, "OK")
    |> halt()
  end

  defp health_response(conn, false) do
    conn
    |> send_resp(503, "SERVICE UNAVAILABLE")
    |> halt()
  end

  defp started?, do: alive?()

  defp alive? do
    Repo
    |> SQL.query!("SELECT 1")
    |> is_struct(MyXQL.Result)
  rescue
    _e -> false
  end

  def ready?, do: not TrafficDrainHandler.draining?()
end

Usually when this comes up it's an issue of your probes being sent in HTTP but your server serving HTTPS (or vice versa).

The other thing I might suggest is that the fall through case on your case conn.request_path is firing, which would not emit a response (this will also error loudly in your logs though).

I find it curious that you're not seeing any logs at all; Bandit generally is much noisier than Cowboy so I'd expect there to be something in your logs for both of the possibilities above.

As a last resort, is it possible for you to remote console into your servers and run Bandit.Logger.attach(:info) to see what's happening?

@Sgiath we're noticing something similar with two different high traffic websites that serve both HTTP and WebSocket traffic. Unfortunately, it's not just the probes that fail but all of the traffic seems to hang until the Pod is restarted. This only started happening for us once we switched to Bandit over the weekend, so as we've monitored this symptom over the past couple of days we're increasingly confident it's related to our migration from cowboy to Bandit. If we can't narrow this down today and we experience another episode, we may revert back to cowboy for the time being to see if it resolves the issue for the remainder of the week.

@mtrudel - I noticed Bandit.Logger.attach(:info) didn't work, so I'm assuming you meant Bandit.Logger.attach_logger(:info)?

This project looks very exciting, so I'm hoping we can figure out what's going on and continue with it in production.

@mtrudel - I noticed Bandit.Logger.attach(:info) didn't work, so I'm assuming you meant Bandit.Logger.attach_logger(:info)?

Yep. Typo.

By any chance:

  • do you have metrics on what happens to your nodes' run queue length when things hang? * what transport (HTTP/HTTPS) and protocol (HTTP/1 / 2 / WS) is in use?
  • have you managed to notice if the hangs are coincident with websocket disconnections (or perhaps with an accumulation of disconnections over time)?

I'm currently working a bug with the nerves hub folks that looks like some sort of resource starvation at websocket disconnect which induces symptoms similar to hanging. I wonder if this isn't the same thing.

For clarity, @bfolkens, would you be able to open a separate issue? It doesn't sound like yours is directly related to @Sgiath 's problem

We are serving plain HTTP from Docker, with SSL termination on the edge of the K8s cluster. We also have WebSockets (since we use LiveView). I am unaware of other traffic timing out, but we have four replicas so other pods may have taken over. But as I said, we switched it out around Christmas, so I no longer have any logs from that period. We are logging all the info and the above logs to Datadog, and nothing was there related to this issue. However, health endpoints are intentionally above the request logger to prevent polluting logs with useless information.
I could try switching our staging to Bandit to see if I could reproduce it when I find time. I don't have any other way to reproduce it.

Thanks for the quick reply @mtrudel !

For clarity, @bfolkens, would you be able to open a separate issue? It doesn't sound like yours is directly related to @Sgiath 's problem

Certainly, reposted with more detail in #311

@Sgiath What load balancer is in front of this? AWS 'network' mode?

I think we may have solved this in Thousand Island 1.3.3 (just released). Can you bump your Thousand Island dep and see if that solves things?

There is AWS ALB as an entry into the K8s cluster
https://aws.amazon.com/elasticloadbalancing/application-load-balancer/

I will try to bump TI and let you know

Nope, that didn't help. I got the first pod restarted 20 min after the deployment and the second one 3 min after that. Same as before, no logs in the Datadog.

@Sgiath if you are able to, can you also try this?

#311 (comment)

To be honest I only think the possibility of this to have any effect is less than 1% but cannot think of anything better so far... unless @mtrudel has a better idea.

Does the rest of your app work as expected? Can you confirm that the only thing that isn't working as expected for you is the endpoints described in the plug above?

If that is indeed the case, I'd suggest moving them below your logging so you can see which one is failing. My hunch would be perhaps something is different with the draining setup, since the other endpoint seems pretty unremarkable (and draining is done differently between cowboy and bandit). Regardless, it would be good to have some log info on them.

Sorry, it probably was a false alarm. I left it running overnight, and no more restarts occurred. Just those two in the first 20 min. I will keep monitoring, but so far, everything seems good.

Yes, the site is working as usual without any observable problems.

Good to hear @Sgiath!

Closing for hygiene; if the issue (or any other) resurfaces, feel free to open another issue!