open-policy-agent/opa

Add support for rule-level tracing

Opened this issue · 8 comments

It would be nice to have a cheap, efficient rule-level tracer that could be enabled in more situations than the current tracing implementation allows. The current tracing implementation operates at the expression level and requires that the evaluator to plug all of in-scope variables to produce values for the trace consumer. While this gives users complete visibility into the execution of the policy, in many cases it's more information than needed. The plug operation in particular is very expensive and means that the tracer cannot be used in latency-sensitive use cases.

A rule level tracer that simply reported the rules that exited and failed to exit would provide a decent starting place. For each rule that exited or failed to exit, the report would include the AST location. An even higher-level report would simply be the names of the virtual documents that were defined or undefined during evaluation.

Importantly, this kind of tracer should have minimal impact on policy execution performance. It should be possible for users to configure OPA to always generate these reports. This could help greatly with providing explanations about why the policy allowed or denied access.

This is a killer feature that we definitely need. Any updates you can provide on where this falls on the roadmap? And as far as interim alternatives go, the best thing I can think of to do is to include the result of all the child rules in the output of the queried rule if there is a failure, thereby having the "reason" show up in the decision logs under "result". something like this?

do_thing = r {
  employee
  2fa
  special_role
  r = { "allowed": true, "reason": "do_thing 1 rules all passed" }
} else = r {
  admin
   r = { "allowed": true, "reason": "do_thing 2 rules all passed" }
} else = r {
  r = { "allowed": false, "reason": "see other properties", "employee": employee, "2fa": 2fa, "special_role": special_role, "admin": admin }
}

Does that seem like a sane approach? It certainly makes policy more verbose...

Any updates you can provide on where this falls on the roadmap?

I think it is something we would like to do, but it isn't actively under development. No specific timeline for when it would be added.

I think this feature was primarily aimed at a use-case where some authz decision wanted to include a reason for a like better UX than a generic "not authorized". For the decision log as sort of an "after the fact" thing you have more options. Assuming that you mostly were interested in the decision log, consider that you can do all the slow tracing you want later on if you are using bundles w/ revisions configured in the manifest (https://www.openpolicyagent.org/docs/latest/management/#bundle-file-format ). The decision log entry includes the full input, query path, and revision (ie bundle source pointer) so you can re-run the exact same evaluation later on with tracing enabled to see exactly what happened.

For including a reason back to the user I think the most common approach is to use the "partial" rules (you'll see tons of policies around with deny[msg] style rules, but its not limited to them). Consider some examples, based on the snippet you shared, which may give you some ideas:

default authz = false

authz {
    count(deny) == 0
    count(allow) > 0
}

allow[msg] {
    employee
    2fa
    special_role
    msg := "do_thing 1 rules all passed"
}

allow[msg] {
    admin
    msg := "do_thing 2 rules all passed"
}

deny[msg] {
    not admin
    not 2fa
    msg := "non admins require 2fa"
}

With something like this you can get your boolean yes/no easily by looking at authz, and if you wanted to you can get the full set of reasons to allow and/or deny it. They're also much easier (IMO) for people to understand the rule logic.

I guess too as a disclaimer this is by no way a prescriptive thing, just an example of another (maybe easier) way to think about writing that type of policy.

Being able to re-run previous queries with slow-tracing on does seem useful, but I don't think it's a full solution for us, for a couple of reasons:

  • We need to mask parts of the input so they don't appear in decision logs (sensitive tokens we don't want to end up in centralized logging).
  • Even if we could keep the tokens in the input, there are time-sensitive rules that will have different results if we were to run it after the fact (e.g. token freshness/expiration).

I think the ideal result for us is that decision logs contain the data necessary for us to visualize the reason a policy decision was rendered. I'm not exactly sure what that could/should look like though.

The snippet you posted does seem like a good approach, although if a caller just calls /v1/data/authz, then the message wouldn't show in the decision log's result, right? I think we're trying to stick with only using the /v1/data API if possible, since the query one requires a bit more knowledge of the inner-workings of OPA (quick context: we'll be rolling this out to a ton of a teams and want to give them the simplest interface possible). We will probably also have a required schema for all our authz responses also, something like:

{
  "allow": true,
  "obligation": {
    "code": 123
   },
  "reasons": [
    "identity is an employee",
    "identity has 2FA"
  ]
}

The tricky part I guess will just be propagating all those "reasons" through a bunch of nested rules. Thoughts?

The tricky part I guess will just be propagating all those "reasons" through a bunch of nested rules. Thoughts?

That type of structured output plus limiting the query does seem like the best approach to take. While I agree that it could be tricky to fill in the reasons (depending on the policies) there are at least a few OPA users that have been successful with that type of approach. There is a lot of prior art to build on for the deny[msg] style rules as most of the kubernetes admission controller integrations use them, along with how conftest structures policies.

The part I'm struggling with in regard to the authz/allow/deny pattern is this:

If all of the allow[msg] partial rules fail, I need a way to capture those failures and their messages in the result, essentially explaining to the caller WHY they aren't authorized, in a human-friendly message. Right now, if any of the rules within an allow[msg] fail, then it doesn't become part of the final allow set, and as far as I can tell, there's no way for me to access the msgs at that point. And then I also need to be able to capture the allow[msg]'s in nested scenarios and propagate them up. Consider this contrived example...

package rock
rock := { "sparkles"}
package shiny
import rock

default yes = false
yes {
  count(reason) > 0
}

reason[msg] {
  rock["sparkles"]
  msg := "the rock sparkles"
}
reason[msg] {
  rock["shimmers"]
  msg := "the rock shimmers"
}
package rare
import rock

default yes = false
yes {
  count(reason) > 0
}

reason[msg] {
  rock["hard_to_find"]
  msg := "the rock is hard to find"
}
reason[msg] {
  rock["expensive_to_mine"]
  msg := "the rock is expensive to mine"
}
package valuable
import shiny
import rare

default yes = false
yes {
  count(reason) > 0
}

reason[msg] {
  shiny.yes
  rare.yes
  msg := shiny.reason | rare.reason
}

In the above example, the rock is shiny but not rare, and thus not valuable. I want the caller to know why the rock isn't valuable. If I call /v1/data/valuable, I want a response that looks roughly like this (the reason could be a different data type, a nested array or something):

{
  "result": {
    "yes": false,
    "reason": "the rock is not valuable because the rock is not rare, because it is neither hard to find nor expensive to mine"
  }
}

Is there a way to do this?

Today, we can't generate an explanation for why access is allowed or denied without it being encoded into the policy. In the example above, the policy provides a reason for why access is allowed -- if you wanted to provide reasons why access is denied then you would have to encode those into the policy as well.

Maybe something like this would work?

decision := {"allowed": allowed, "reason": reason}

default allowed = false

allowed {
   count(allow) > 0
   count(deny) == 0
}

reason := allow { allowed }
reason := deny { not allowed }

allow[msg] { ... }   # reason to allow #1
allow[msg] { ... }   # reason to allow #2
...

deny[msg] { ... }    # reason to deny #1
deny[msg] { ... }    # reason to deny #2
...

You would then query for decision.

Typically we don't see people providing reasons why access is granted--they provide reasons why access was not granted. E.g.,

decision := {"allowed": allow, "reason": reason}

default allow = true

allow { conditions_1 }
allow { conditions_2 }
...

reason["conditions_1 not met"] {
   not conditions_1
}

reason["conditions_2 not met"] {
   not conditions_2
}

It sounds like this would handle #2897.

stale commented

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days.