open-policy-agent/opa

Add the ability to create decision labels within a policy

Opened this issue ยท 28 comments

BLUF: We are looking to add the ability to add labels to a decision log from within policy, so that we can filter and audit those decisions to be reported and ultimately analyzed for malicious access attempts. This label data must be separate from the decision "result" data.

The project I am working on has many distributed services and we are moving toward utilizing OPA for them. We have implemented some common policies to be used as a bundle for most services. We want each development team to control how they write their own policies and their JSON contract between their applications and OPA instances. However, we need to grant them the ability to run common policies and report when specific access attempts are made.

A majority of our applications deal with data about specific people. Our organization has complex rules around read access to specific people or "participants" which need to be accessible to a plethora of applications. Based on certain attributes, a user may or may not have read access to a participant. Similar to github's repository access returning a 404 when you don't have access to view a repository, our applications cannot confirm that the resource exists when the user does not have read access to the participant. Functionally, this comes down to several convenience functions such as data.common.has_read_access_to_participant_by_id(user_id, participant_id). This function could return an object representing the permit/deny and the additional data needed to report the access, but we cannot guarantee that the consuming policy will put that data on their policy response. If the data was on the policy response, this could be a security concern. Since this is a mandated requirement that report the access attempts, we have the following options.

  1. Force consumers to call the common policy via a http call. Either their application can asynchronously call both their standard RBAC policy and the common policy, or their standard policy can call the common policy via http.send. The former is better performance-wise but wouldn't work for something like and envoy proxy http_authz filter that will only call one policy. This has been tested as a system calling http.send to the policy on localhost and has shown poor performance with load.
  2. Create a built-in function that makes an asyncronous http call to report the access attempt. This would be used by the common policy. We would need to configure retry ability like the decision log api does.
  3. Create a built-in function that allows the ability to add a Map of data to the decision context, then output that data with the decision log api on a new field decision_labels or similar. The decision logs (which we push to kafka) can then analyze the data from the labels and filter a topic to those that need to be reported.

We believe that option 3 would provide many OPA users new functionality that could cause a great improvement to their workflow. This label functionality is very similar to the nd_builtin_cache attribute, however, we MUST have the label data on the decision log API. The nd_builtin_cache data could be stripped if the content became to big. It is also similar to the existing labels attribute on the decision log which provide server data, but do not allow the ability to add labels for the specific context.

This topic has previously been discussed on the OPA Slack here:
https://openpolicyagent.slack.com/archives/CBR63TK2A/p1692975390194469

A developer on my team is beginning to develop this functionality today and will put in a pull request when the work is done. We're hoping that you will approve and add the functionality to OPA.

Hello @david-hamilton-bah thanks for the detailed issue and also for providing info on your use-case. From what I understand the app developers will author some policies and use the common helper packages y'all provide to come up with the final decision. Is this correct?

Create a built-in function that allows the ability to add a Map of data to the decision context, then output that data with the decision log api on a new field decision_labels or similar.

What's not clear to me why these extra "labels" that you need for filtering purposes cannot be part of the policy decision (ie result field). I would have imagined the policy decision is an object that would contain all the necessary context which the decision log server could use for filtering, analytics etc. It seems very specific to your use-case that label data should be separate from the result. Something to consider would be to define a policy decision format which the app developers need to adhere to and in this way you know how the policy decision will look. I don't know much about your use-case to tell whether this is feasible but happy to expand if you provide more details. Thanks!

From what I understand the app developers will author some policies and use the common helper packages y'all provide to come up with the final decision. Is this correct?

Correct.

What's not clear to me why these extra "labels" that you need for filtering purposes cannot be part of the policy decision (ie result field). ... Something to consider would be to define a policy decision format which the app developers need to adhere to and in this way you know how the policy decision will look.

Defining a standard policy decision format is something I am attempting to avoid. This is a very large project with many teams distributed across the country. If we had to make changes to that standard format, we would need to have each team apply any new changes there. Many will be using istio envoy proxies that are expecting a simple true/false value on the response. Even with mandated response format, we could open the door for misuse and missing reporting in these cases. Putting the data in response also goes against Zero Trust principles in that it sends additional data to the consuming application that it does need. That data may end up being logged by the consuming application, potentially creating a leak.

You could use decision log masking to add data to the result (or the logged input if you so prefer) without it being part of the actual result returned to clients.

package system.log

import rego.v1

mask contains {"op": "upsert", "path": "/input/result/labels", "value": labels} if {
    # some logic to determine what labels to apply here
    labels := #...
}

I think that would be preferable over adding a new built-in to OPA which was essentially a no-op in any context but opa run --server.

@anderseknert
That's an interesting thought, we are already using log masking to hide PII. However, how would I get the data from my common policy into the decision context in order to add it to the result in the log masking rego?

You can import and evaluate any rules you normally would from your decision log policy. I guess it could be a problem if something was really expensive to calculate in the first place, and you now would have to do it once for the client and another time in the masking policy (possibly with some mocked input attribute indicating that you'd want extra data returned in the result). The masking policy runs asynchronously, so the user isn't kept waiting, but it obviously adds some load to the OPA instance nonetheless. I could think of a few ways that could be improved if necessary, but it's a bit early to expand on that before knowing if/where it is a problem in the first place :)

Adding built-in functions that have a hard coupling to the way OPA is run (like various logging functions suggested in the past), and that wouldn't work with opa eval, opa test, etc... is something at least I'd like to avoid if possible.

I can understand not wanting to introduce built-ins that do not have a purpose for multiple methods of running opa. Could this functionality of context labelling serve a purpose to also be logged using --explain similar or enhancing trace functionality?

I know for opa test usage I would like to see the data added to the context within the trace

The system.log policies could/should be unit tested as other code. Something like rule level tracing would indeed be a nice addition to OPA, and probably necessary if it's meant to be always on.

Sorry, but I'm still not seeing how the desired behavior could be handled in the log masking. Say our tenant application has an endpoint GET /api/widgets/123 and they are using an envoy proxy ext_authz filter. In their policy they first check user roles (RBAC) then they use the data to determine to whom this widget belongs. Let's say participant id 444. Then they call data.common.has_read_access_to_participant_by_id(user_id, participant_id) and it returns false. The policy evaluates to false and the proxy returns 401.

Now asynchronously in the log policy we have the envoy input and the response output. Neither of those things provides us, in the common functionality, the ability to re-call the has_read_access with the participant id. We would have to translate from the widget id to the participant id, and that would require our common policy to have domain knowledge of widgets.

Now asynchronously in the log policy we have the envoy input and the response output. Neither of those things provides us, in the common functionality, the ability to re-call the has_read_access with the participant id.

In the first policy eval, how was the participant_id determined? You mentioned some RBAC roles had to be looked up to get the participant_id. One option is to make participant_id part of the result and have that available in the masking policy.

Alternatively how are the RBAC roles provided? Are they part of a OPA bundle? If yes, the masking policy will have access to the OPA store which means it can access the RBAC data as well to lookup participant_id.

Another option is to make a http.send call from the masking policy but you've found this to not be performant iirc.

Since you mentioned envoy proxy ext_authz filter, I imagine you're using the OPA-Envoy plugin. There are ways to inject data in the result object using that plugin but we can explore that later.

i probably shouldn't have mentioned the RBAC, that was just an example of what the tenant application may do. They may do some RBAC, then they have to determine read access to this particular resource. The data in OPA will have the mapping of widget->participant, but that is tenant data. Where the data comes from is irrelevant, the point is that it belongs to the tenant domain, and the common policy has no concept of a widget, only participant. I understand the possibility of enforcing that the tenant put the participant id in the response, but again it puts more onus on the tenant to ensure that they are compliant and produces the chance of this mandated reporting being missed. It also comes back to forcing the policy to be re-run in order to add the data to the decision log. This all seems very hacky compared to being able to label the decision context.

To me this is a very natural functionality. I would compare it to the Mapped Diagnostic Context (MDC) used in Log4j, which allows you to set data on the context that is then used for logging. See here for example. Our legacy applications using Sun XACML have access to the EvaluationCtx object and can store data there that is later processed after the evaluation is complete. Does the opa.runtime() or any other method accessible from policies have a way to access the context within a policy?

It also comes back to forcing the policy to be re-run in order to add the data to the decision log.

If the participant id is part of the result then there is no need to re-run the policy to add this in. You either do it as part of the tenant policy or in the masking policy. The former as you said puts the onus on the tenant. The latter is controlled by the OPA admin which may be separate from the policy author. So this approach is allowing you to customize the additional data that you want to insert in the log post policy eval. Also since the masking policy uses the same in-memory store as the tenant, I would image that could be leveraged as well.

Does the opa.runtime() or any other method accessible from policies have a way to access the context within a policy?

Can you elaborate a bit on what you mean by "context within a policy"?

If the participant id is part of the result then there is no need to re-run the policy to add this in.

The participant ID is not the data, the participant id is the minimum needed in order to evaluate whether the user has access. All of the policies would need to be re-evaluated to determine if the user can access the participant and then put that data in the log. Just making up a sample where users can't view users from participants belonging to another organization, the data that may be logged would be:

{
  user: {
    name: "Bob", 
    organization: "ABC"
    ...
  }, 
  participant: {
    name: "Alice", 
    organization: "DEF"
    ...
  }, 
  action: "View Participant Widgets", 
  decision: "Denied", 
  textReason: "Users belong to different organizations",
  ...
} 

This is determined by a policy that would have had to run before the decision, and again in order to log the data.

Can you elaborate a bit on what you mean by "context within a policy"?

The Context in which the policy has run. An object that holds the information like the request, response (which obviously would not exist yet), NDBuiltinCache, trace ids, etc.

Thanks for providing the above sample. In your original issue description, your plan was to Create a built-in function that allows the ability to add a Map of data to the decision context. Now let's say we have that, so in your use-case would this builtin be invoked in the tenant policy?

No, the common policy would invoke the built-in to add the data.

My current project also have similar requirements, we want to have in our logs the reason for the decision, using a common component, without having it added to the result of all (many) rules.
I don't really see why not, but I'm biased ๐Ÿ˜„
If the --server thing is a blocker, then maybe also add it to the output for use with --explain or kindof what print does?

stale commented

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

I'm not sure if the notion of a builtin that's used for its side effects only is a good one in Rego. We've explicitly stated in the docs that custom builtins shouldn't do that, and that http.send should not be used to trigger side-effects in other systems. And now we'd introduce a builtin for just that, a side effect? ๐Ÿค”

The result of that would make for a difficult-to-use feature, I'm afraid. People would need to know enough about the Rego evaluation semantics to be able to use it correctly; and even so, future optimizations could break things by no longer calling the builtin.

Are there any other things we could consider? Some clever scheme using rule annotations, perhaps?

I have a similar need as well.

For example, I want to add integration with a feature flag system (say unleash) directly in policies as a custom builtin. And it would be great to have the feature flag result in decision logs.

In this case if the builtin implementation had access to append to this context in go directly, would be enough.

Maybe some KV pair data structure that gets propagated from the builtin context to the context that goes to the decision log plug-in?

This way we don't add a new default builtin to bloat the rego api, but it is handled internally by custom builtin implementations.

Also, it might be easier than handling custom metadata.

By reading @david-hamilton-bah request it seems that the custom function that handles participant data may call a custom builtin already, or that at least could be solved this way.

In this case if the builtin implementation had access to append to this context in go directly, would be enough.

๐Ÿค” So if the evaluator (topdown) passed the context along while giving the builtins a chance to append key/val pairs by the usual context.WithValue machinery; then a custom decision log plugin could extract and process them. @xico42 is what what you had in mind?

So this would become less of an enhancement request and more a smaller-sized internal thing that would enable power-users to do stuff they hadn't (easily) been able to do before.

Exactly @srenatus. Although I was thinking about something coded into the default decision log, if this was possible by using a custom one, I would be happy with it either way.

Just to make my self a bit clearer, that's what I had in mind:

// Package logs implements decision log buffering and uploading.
package logs

type ctxKey struct{}

type KV struct {
	m map[string]interface{}
	mu *sync.RWMutex
}

func (kv *KV) Get(key string) (interface{}, bool) {
	kv.mu.RLock()
	val, ok := kv.m[key]
	kv.mu.RUnlock()
	return val, ok
}

func (kv *KV) Set(key string, val interface{}) {
	kv.mu.Lock()
	kv.m[key] = val
	kv.mu.Unlock()
}

func WithContext(ctx context.Context) context.Context {
	return context.WithValue(ctx, ctxKey{}, &KV{})
}

func Set(ctx context.Context, key string, val interface{}) {
	kv, ok := ctx.Value(ctxKey{}).(*KV)
	if !ok {
		return
	}
	
	kv.Set(key, val)
}

Then right before policy evaluation, the context would be filled with the KV datastructure:

ctx := logs.WithContext(ctx)

And in custom builtin implementations, one could do the following:

// implements topdown.BuiltinFunc
func builtinRepeat(bctx BuiltinContext, operands []*ast.Term, iter func(*ast.Term) error) error {
    // custom builtin logic...
    logs.Set(bctx.Context, "someKeyThatGetsAppendendToDecisionLogs", "some random value that makes sense to me")
    return iter(ast.StringTerm(strings.Repeat(string(str), count)))
}

Later on, the KV map would be extracted and rendered into an extra key in the decision logs.

Ah, so this would allow us to side-step the problem that ctx isn't forwarded through iter? I.e. we don't have to do the more idiomatic

ctx = logs.WithKV(ctx, "key", "val")
return iter(ctx, ...)

...which would (probably) require deeper changes in the eval code.

Does it make sense? I could provide a PR with such implementation, as it seems to have low impact in the rest of the code.

Oh I think it makes sense, thank you!

The overall design question kind of remains, though -- how do you provide the extra keys? With what you're proposing, it would be up to the user to create the builtin (simple enough), and use it properly. And that would be the bar -- you'd need to be able to come up with (or find) the code, and to use it properly. As opposed to providing the simple builtin as part of mainline OPA, with the same caveats. So, I feel like this is still a bit of a well-designed pitfall, and would only be one small step further away ๐Ÿ˜…

Even if a simple builtin as part of mainline OPA is desired, it could benefit from this solution anyway.

So it seems that it enables both approaches.

Alternative Approaches considered (as requested by @srenatus):

  1. Using Regal to ensure return data includes decision result via policy enforcement
  2. Calling http.send a second time specifically to add the desired information to the context so it will be present in the Decision Log
  3. Embed OPA directly into consumer services Allowing consumers to add the data directly into the context via local policies

In the cases of Alternatives 1 and 3, this forces the responsibility to include the data onto the consumer which is not the intent of our use case. In the case of Alternative 2, this was our original solution, but when implemented in a test environment the latency experienced quickly became unacceptable as multiple service calls to OPA would "stack" the queries.

  1. Calling http.send a second time specifically to add the desired information to the context so it will be present in the Decision Log

I'm afraid I don't understand this one. Do you call OPA via http.send from another OPA instance? ๐Ÿค” Or from the same OPA instance?

Same instance, although it could also be a different instance if we chose to go that route. Basically in the function data.common.has_read_access_to_participant_by_id(user_id, participant_id) we check if we have this rest service enabled. If disabled, we perform the check via normal rego; if enabled then we POST a request to the common policy, for example v1/data/common/participant/read. This server endpoint is property that could be set to localhost or to a separate OPA instance.

What this does is ends up creating a new request, so we get a separate decision log with the result that we want in it, and the output will affect the result of original request. It makes it easy to find the decision logs because they all will have the same request path. However, the downside is the performance. On a lightly-used instance there is minimal lag, but a heavily-used instance will end up with large delays due to stopping mid-request and putting a new request "in the queue" that it needs to wait for.