open-policy-agent/opa

Status plugin delays start-up when falling back to persisted bundles

charlieegan3 opened this issue · 6 comments

When persisting discovery bundles, and having configured a status service, when opa restarts during downtime in the status service, startup is delayed while attempting to run status plugin.

To reproduce the issue:

Create a simple go server:

package main

import (
	"fmt"
	"log"
	"net"
	"net/http"
	"os"

	"github.com/open-policy-agent/opa/bundle"
)

func main() {

	port := 8080

	if len(os.Args) > 1 {
		http.HandleFunc("/discovery", discoveryHandler)
		http.HandleFunc("/bundle", bundleHandler)

		fmt.Printf("Starting server on localhost:%d...\n", port)
		err := http.ListenAndServe(fmt.Sprintf("localhost:%d", port), nil)
		if err != nil {
			fmt.Printf("Error starting server: %s\n", err)
		}

	}

	// start a timeouting server
	listener, err := net.Listen("tcp", fmt.Sprintf("localhost:%d", port))
	if err != nil {
		fmt.Println("Error starting server:", err)
		return
	}
	defer listener.Close()

	// This server will run indefinitely without accepting incoming connections.
	select {}

}

func discoveryHandler(w http.ResponseWriter, r *http.Request) {

	b := bundle.Bundle{
		Data: map[string]interface{}{
			"discovery": map[string]interface{}{
				"services": []interface{}{
					map[string]interface{}{
						"name": "localhost",
						"url":  "http://localhost:8080",
					},
				},
				"status": map[string]interface{}{
					"service": "localhost",
				},
				"bundles": map[string]interface{}{
					"main": map[string]interface{}{
						"service":  "localhost",
						"resource": "/bundle",
						"persist":  true,
					},
				},
			},
		},
	}

	w.Header().Set("content-type", "application/vnd.openpolicyagent.bundles")
	err := bundle.NewWriter(w).Write(b)
	if err != nil {
		log.Printf("error creating bundle: %v\n", err)
		w.WriteHeader(http.StatusInternalServerError)
	}
}

func bundleHandler(w http.ResponseWriter, r *http.Request) {

	b := bundle.Bundle{
		Data: map[string]interface{}{"foo": "bar"},
	}

	w.Header().Set("content-type", "application/vnd.openpolicyagent.bundles")
	err := bundle.NewWriter(w).Write(b)
	if err != nil {
		log.Printf("error creating bundle: %v\n", err)
		w.WriteHeader(http.StatusInternalServerError)
	}
}

Run the server:

go run main.go server

Create some opa config:

services:
  - name: localhost
    url: http://localhost:8080
discovery:
  name: localhost
  resource: /discovery
  decision: discovery
  persist: true
  polling:
    min_delay_seconds: 10
    max_delay_seconds: 10

Run OPA, clearing the local state:

rm -rf .opa && go run main.go run -s --addr=127.0.0.1:8181 --config-file=config.yaml

OPA starts ok, and the bundles are persisted to disk.

Now stop the go server and OPA, and run the server in 'timeout mode' where a dial timeout is simulated:

go run main.go

Now run OPA again, using the local state:

go run main.go run -s --addr=127.0.0.1 --config-file=config.yaml

We see OPA waits trying to report the status before starting:

$ go run main.go run -s --addr=127.0.0.1:8181 --config-file=config.yaml
{"addrs":["127.0.0.1:8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server.","time":"2023-10-25T12:08:59+01:00"}
{"level":"info","msg":"Starting bundle loader.","name":"main","plugin":"bundle","time":"2023-10-25T12:08:59+01:00"}
{"level":"info","msg":"Starting status reporter.","plugin":"status","time":"2023-10-25T12:08:59+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:19+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:19+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:29+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:39+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:39+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:09:59+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:59+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:10:09+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:10:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:10:19+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:10:19+01:00"}
{"client_addr":"127.0.0.1:62971","level":"info","msg":"Received request.","req_id":1,"req_method":"GET","req_path":"/health","time":"2023-10-25T12:10:26+01:00"}
{"client_addr":"127.0.0.1:62971","level":"info","msg":"Sent response.","req_id":1,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.788292,"resp_status":200,"time":"2023-10-25T12:10:26+01:00"}

Note the timestamps, and that all this time the agent isn't serving. The instance does start, but it is delayed.

This appears to be related to the loop activating plugins here: https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L240. It'd be good to avoid having the status plugin delaying boot when using persisted bundles.

Also notice that the for retry := 0; retry < maxActivationRetry; retry++ loop in https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L259 seems to lack a break statement at the end, so even when successfully activating the bundle in first attempt it will still run the loop 10 times (further slowing down the startup time).
I have confirmed this with logs (seeing 10 instances of the "Discovery bundle loaded from disk and activated successfully." debug log message before the server starts).

Also notice that the for retry := 0; retry < maxActivationRetry; retry++ loop in https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L259 seems to lack a break statement at the end, so even when successfully activating the bundle in first attempt it will still run the loop 10 times (further slowing down the startup time). I have confirmed this with logs (seeing 10 instances of the "Discovery bundle loaded from disk and activated successfully." debug log message before the server starts).

Good catch. This is a bug.

Note the timestamps, and that all this time the agent isn't serving. The instance does start, but it is delayed.

This appears to be related to the loop activating plugins here: https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L240. It'd be good to avoid having the status plugin delaying boot when using persisted bundles.

@charlieegan3 I haven't tried this but have we looked into using buffered channels?

Alternatively since the manager has the plugin status on it, we could do the status upload async.

Hey, I've been having a play with using a buffered channel for the updates this morning. It works, but our tests are quite brittle around some of this stuff and so they need to be reworked. I'll see where I can get to this afternoon.

#6345 is all I had time for today. I've tried to make some careful updates to tests to maintain the behaviour, but remove the expectation that the status plugin will be in State:OK immediately. Take a look, I'm out tomorrow so feel free to take over and make edits as needed.