open-policy-agent/opa

Failed to shutdown OpenTelemetry trace exporter gracefully.

Closed this issue · 5 comments

Short description

When running OPA with the distributed tracing option enabled, we get the following error when the server is stopped.

Failed to shutdown `OpenTelemetry` trace exporter gracefully.

Steps To Reproduce

  1. Get the latest version of opa (I've tried with version 0.62 and the latest version)
./opa_linux_amd64 version
Version: 0.63.0-dev
Build Commit: d3a4a87f506dec744ed93177c37bea5e2afa186b
Build Timestamp: 2024-03-27T19:08:00Z
Build Hostname: nico-Latitude-7420
Go Version: go1.21.0
Platform: linux/amd64
WebAssembly: available
  1. Run the following command line
./opa_linux_amd64 run --server --set distributed_tracing.type=grpc
  1. Kill the process (Ctrl + C works fine)
  2. The full output should looks like that
./opa_linux_amd64 run --server --set distributed_tracing.type=grpc
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-03-28T09:18:37+01:00"}
^C{"level":"info","msg":"Shutting down...","time":"2024-03-28T09:18:38+01:00"}
{"level":"info","msg":"Server shutdown.","time":"2024-03-28T09:18:38+01:00"}
{"err":"context canceled","level":"error","msg":"Failed to shutdown OpenTelemetry trace exporter gracefully.","time":"2024-03-28T09:18:38+01:00"}

Expected behavior

The OpenTelemetry trace exporter should be gracefully shut down, and no error should be seen.

Additional context

We have seen this error appeared after upgrading OPA from version 0.59 to 0.62.

Thanks for bringing this up. I think it's not wired up correctly: This

opa/runtime/runtime.go

Lines 538 to 543 in d3a4a87

defer func() {
err := rt.traceExporter.Shutdown(ctx)
if err != nil {
rt.logger.WithFields(map[string]interface{}{"err": err}).Error("Failed to shutdown OpenTelemetry trace exporter gracefully.")
}
}()

should happen in the shutdown code path, maybe here

opa/runtime/runtime.go

Lines 845 to 861 in d3a4a87

func (rt *Runtime) gracefulServerShutdown(s *server.Server) error {
if rt.Params.ShutdownWaitPeriod > 0 {
rt.logger.Info("Waiting %vs before initiating shutdown...", rt.Params.ShutdownWaitPeriod)
time.Sleep(time.Duration(rt.Params.ShutdownWaitPeriod) * time.Second)
}
rt.logger.Info("Shutting down...")
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(rt.Params.GracefulShutdownPeriod)*time.Second)
defer cancel()
err := s.Shutdown(ctx)
if err != nil {
rt.logger.WithFields(map[string]interface{}{"err": err}).Error("Failed to shutdown server gracefully.")
return err
}
rt.logger.Info("Server shutdown.")
return nil
}

or here

opa/server/server.go

Lines 225 to 249 in d3a4a87

func (s *Server) Shutdown(ctx context.Context) error {
errChan := make(chan error)
for _, srvr := range s.httpListeners {
go func(s httpListener) {
errChan <- s.Shutdown(ctx)
}(srvr)
}
// wait until each server has finished shutting down
var errorList []error
for i := 0; i < len(s.httpListeners); i++ {
err := <-errChan
if err != nil {
errorList = append(errorList, err)
}
}
if len(errorList) > 0 {
errMsg := "error while shutting down: "
for i, err := range errorList {
errMsg += fmt.Sprintf("(%d) %s. ", i, err.Error())
}
return errors.New(errMsg)
}
return nil
}

@nicolaschotard if you'd like to submit a fix based on Stephan's feedback feel free to do so. We're happy to help out. Thanks for reporting this.

Hey thank you both for your quick answers. And indeed, moving the rt.traceExporter.Shutdown(ctx) part into the gracefulServerShutdown function seems to fix this issue. I will submit a PR with a fix as soon as I get the time to work on that.

And I believe this commit introduced the bug.

stale commented

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.