open-policy-agent/opa

Have metrics dedicated to context cancelled/deadline exceeded

rudrakhp opened this issue · 3 comments

What is the underlying problem you're trying to solve?

We have logs that can help identify context deadline exceeded and context cancel events. But from monitoring and alerting perspective there is no metric today (referring to this list).

Describe the ideal solution

I think it would be good to have metrics dedicated to context cancels, along with a reason tag maybe (check request timed out, http send timed out, context cancelled during X eval, etc)

Describe a "Good Enough" solution

We could skip having a reason tag in the short term, but a basic metric would definitely be helpful

Additional Context

N/A

The opa-envoy plugin has the option to include performance metrics via prometheus. We could add a counter in there for this. These metrics are then surfaced via the Status API.

@ashutosh-narkar Thanks for the quick response!

I have been trying to capture and classify various errors we are getting in our logs. Here is a log I had a question about:

{
  "level": "error",
  "msg": "Log event masking failed: eval_cancel_error: caller cancelled query execution.",
  "plugin": "decision_logs",
  "time": "2023-10-28T13:51:59Z"
}

I see a TODO here, is this why the error is not propagated to envoy plugin where ideally the complete decision log (including input) should be logged? Any pointers so I can understand this issue better would be helpful. Thanks!

The code you're referred to is old. OPA uses the main branch not master. That code has been removed and we maintain metrics for errors in the decision log plugin. For the specific error in your log, there is currently no counter to track it. So we can add one or like I mentioned previously you can add a counter in the plugin itself.