open-policy-agent/opa

Panic in opa-envoy-plugin error metrics

Closed this issue · 6 comments

I think I'm seeing a panic in an openpolicyagent/opa:envoy-0.58.0-3-rootless container (I think that's opa-envoy-plugin @ 23923671159a15044e4f0e9989e1dca56ca59d80) that points at this line, recently introduced in open-policy-agent/opa-envoy-plugin#477.

Here's the traceback:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x884092]

goroutine 172 [running]:
github.com/prometheus/client_golang/prometheus.(*MetricVec).GetMetricWith(0x0, 0xc000396201?)
	/src/vendor/github.com/prometheus/client_golang/prometheus/vec.go:253 +0x32
github.com/prometheus/client_golang/prometheus.(*CounterVec).GetMetricWith(0x1c4f9e0?, 0xc00063e870?)
	/src/vendor/github.com/prometheus/client_golang/prometheus/counter.go:259 +0x16
github.com/prometheus/client_golang/prometheus.(*CounterVec).With(...)
	/src/vendor/github.com/prometheus/client_golang/prometheus/counter.go:284
github.com/open-policy-agent/opa-envoy-plugin/internal.(*envoyExtAuthzGrpcServer).check.func3()
	/src/internal/internal.go:397 +0x2e7
github.com/open-policy-agent/opa-envoy-plugin/internal.(*envoyExtAuthzGrpcServer).Check(0x1d68480?, {0x2315670?, 0xc0006140c0?}, 0x0?)
	/src/internal/internal.go:355 +0x42
github.com/envoyproxy/go-control-plane/envoy/service/auth/v3._Authorization_Check_Handler({0x1dde5e0?, 0xc000392460}, {0x2315670, 0xc0006140c0}, 0xc00022c100, 0x0)
	/src/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v3/external_auth.pb.go:692 +0x169
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000146000, {0x2315670, 0xc000191f80}, {0x231bde0, 0xc000582820}, 0xc0005f8a20, 0xc00032cae0, 0x2fdf3f0, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc000146000, {0x231bde0, 0xc000582820}, 0xc0005f8a20)
	/src/vendor/google.golang.org/grpc/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/src/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 151
	/src/vendor/google.golang.org/grpc/server.go:997 +0x145

I don't know what original error is trying to be unwrapped and logged here, because I never get a chance to see it.

Thanks for reporting @tilgovi! @rudrakhp can you please take a look?

@ashutosh-narkar Ah I see the issue here. Both performance and error metric objects are initialised here only if performance metrics are enabled in config. But we are emitting error metrics regardless of this configuration. @tilgovi can probably confirm this by checking if they have not enabled performance metrics.

I would suggest moving error metric initialisation out of this condition or emit error metric only if performance metrics are enabled (but I think error metric should be emitted regardless since it's not performance related). Let me know which one sounds good, you can assign this to me and I will push a quick fix.

PS: Also will be useful add a debug log for the internalErr object before emitting error metric.

That's correct. I had not enabled performance metrics.

Thanks for looking into this @rudrakhp.

emit error metric only if performance metrics are enabled

This sounds like a good approach. If someone wants these metrics they can opt-in.

@ashutosh-narkar above fix should suffice. Please review.

Thanks for the quick resolution!