Should not attempt to modify resources in normal namespace
Closed this issue ยท 27 comments
Describe the bug
We are running a proof-of-concept in one of our OpenShift clusters. And while the controller runs find without any watches, it fails badly when we enable watches of roles and role bindings. This issue might be relevant for other resource types, but namespace RBAC is the obvious start for propagating namespace resources for us.
It seems like the controller attempts to annotate resources in namespaces that are not annotated with any accurate.cybozu.com
annotations, and this must be wrong! Errors are emitted constantly, and here is an example:
{"level":"error","ts":"2023-09-25T18:21:37Z","msg":"failed to check the controller reference","controller":"role","controllerGroup":"rbac.authorization.k8s.io","controllerKind":"Role","Role":{"name":"cluster-samples-operator","namespace":"openshift-cluster-samples-operator"},"namespace":"openshift-cluster-samples-operator","name":"cluster-samples-operator","reconcileID":"957d956a-b15e-4463-a458-ca5b1a32582b","error":"failed to add accurate.cybozu.com/propagate-generated annotation: roles.rbac.authorization.k8s.io \"cluster-samples-operator\" is forbidden: user \"system:serviceaccount:accurate:accurate-controller-manager\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:accurate\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held"
This is a total blocker for us. The controller must NOT touch resources in namespaces that are irrelevant for Accurate.
Environments
- Version: 1.1.0
- OS: N/A
To Reproduce
- Install Accurate
- Configure Accurate to watch roles and role bindings
- Observe that Accurate adds (or attempts to add) annotations to resources in namespaces that are not observed/managed by Accurate. On Openshift this update is unsuccessful and the controller logs become extremely chatty.
Expected behavior
No attempt to update irrelevant namespace resources and no errors in controller logs.
Additional context
The controller errors might only occur on Openshift, which is a secure by default Kubernetes distribution. But the root cause here is the wrong attempt to change irrelevant resources.
@erikgb
Thanks for your feedback.
We will investigate it.
This is fixable but not easy.
The accurate controller tries to add that annotation to record if the resource is generated from another resource, mainly for performance reasons.
If we skip this check, we have to check all watched resources when an Accurate label is added to a namespace.
With Accurate, namespaces are created dynamically by users, so statically granting permission only for a set of namespaces to the accurate controller wouldn't make sense, I guess. If we grant permissions for all namespaces, this issue is no longer a problem.
I want to mark this as by design.
I want to mark this as by design.
That would be very unfortunate and imply that Accurate is not something we can use in our clusters. And I would expect this to be problematic for a significant share of potential users of Accurate. Is this design mentioned somewhere in the docs?
By addressing, not to speak of modifying, resources in namespaces not marked as an Accurate-namespace, ref. https://cybozu-go.github.io/accurate/concepts.html#namespace-types, I think Accurate should be considered a holistic software product. Meaning that you cannot by any means migrate to/from Accurate. Your cluster must be bootstrapped with Accurate and deleted if you want to remove Accurate.
A temporary workaround could be introducing an operator config (regexp) for namespaces that Accurate should stay away from. But this would just be a minimum viable workaround and not a solution to the problem IMO.
I could be interested in participating in designing a fix for this, and performance is important. But not more important than doing the right thing. ๐
@erikgb
Let me ask a few questions.
- The accurate controller failed to modify some resources in your cluster, and errors were recorded. What was your problem with this behavior? Noisy error logs?
- When you give Accurate permissions only for limited namespaces, how will you give it permissions for new namespaces that your user creates via SubNamespace?
- The accurate controller failed to modify some resources in your cluster, and errors were recorded. What was your problem with this behavior? Noisy error logs?
Noisy error logs for sure. And since Accurate has RBAC in all namespaces I would have to check if the controller managed to add annotations to any resources of the watched kind. I can do that.
- When you give Accurate permissions only for limited namespaces, how will you give more permissions when a user creates a new SubNamespace?
I don't think we can/should limit the Accurate permissions to namespaces using RBAC. But IMO Accurate should never modify (or attempt to modify) resources in namespaces that are not annotated with an Accurate annotation. Could you please explain why this requirement is not possible to implement?
Thank you.
This problem can be fixed. It's just not very easy.
Interestingly, we find the same error in our environments even though our Role resources are in Accurate managed namespaces and the accurate controller has the admin privilege for Roles.
This is caused because that Role has more privileges than ClusterRole admin
.
Kubernetes API server has an additional check when editing Roles to prevent privilege escalation, so Roles are very special.
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#restrictions-on-role-creation-or-update
Nevertheless, our Accurate has been working fine for us because the Roles that caused error logs were actually not intended to be propagated (if so, we'd have given necessary privileges).
Based on these, we are open to accepting a fix for this, but for us, this is not a high priority.
Based on these, we are open to accepting a fix for this, but for us, this is not a high priority.
@ymmt2005 Thanks for your feedback on this. I will discuss with my team to eventually make a contribution in this area. I would expect you to prefer a design proposal before this change is implemented in a PR. Do you have any guidelines on the structure of such designs in this project?
Not in this project, but we usually write a design doc when we work on something new.
Examples are in https://github.com/cybozu-go/moco/tree/main/docs/designdoc
You may do the same, or since this is just fixing a bug, you can also use this issue to
explain your idea for the fix.
To make the issue a bit more precise, I suggest the following as a starting point for restricting the scope in which Accurate acts:
If neither a namespace nor any resources within it has an Accurate annotation, and no other Accurate annotations reference the namespace, Accurate should not act within the namespace.
Then I wonder:
- Is propagate-generated currently the onle feature causing Accurate to act outside of the scope stated above?
A very simple and temporary solution to the problem would then be allowing the feature to be disabed entirely.
- Would only annotating dependents with generate if their owner has the propagate-generated annotation restrict Accurate to the scope stated above?
To not have to check owners for all watched resources when Accurate is added to a namespace, cound the owners instead be watched, and their dependants be annotated generate whenever they are annotated propagate-generated? Similarly to configuring watched resource types, one would also configure which resource types can have propagate-generated. The
already existing watched list could be used, or a separate list could be added to the config.
Is propagate-generated currently the onle feature causing Accurate to act outside of the scope stated above?
Yes, as far as I remember right now.
Would only annotating dependents with generate if their owner has the propagate-generated annotation restrict Accurate to the scope stated above?
I don't understand this question very well.
Could this line be removed:
accurate/controllers/propagate.go
Line 340 in 6cd77bf
So that watched resources are only annotated with generate if they have an owner with the propagate-generated annotation.
That annotation is added for better performance.
Since a resource is generated only once, we want to check the owner only once.
Without this annotation, Accurate would have to check the same resource many times.
I agree that the performance benefit is substantial! Could we still get around it somehow?
What if it was required to specify which resource types may be an owner (and have the propagate-generated annotaion)? So that owners are only looked up if they are of correct resource type? Could even configure owner resource types per resource type, to further limit lookups. E.g. {owners: {Secret: [Certificate, ExternalSecret]}}
, meaning both Certificate
and ExternalSecret
can generate Secret
.
What if it was required to specify which resource types may be an owner (and have the propagate-generated annotaion)?
It's doable, but I don't know if it helps mitigate this reported issue.
In combination with removing the mentioned line, it could be a tradeoff between performance and resolving the issue.
Otherwise, I think we need some outside the box thinking to get this one resolved.
What if that line is removed by default. And a feature gate (or flag) enables better performance, with the tradeoff of annotating additional resources in other namespaces?
What if that line is removed by default. And a feature gate (or flag) enables better performance, with the tradeoff of annotating additional resources in other namespaces?
I prefer a fix to skip that line if the namespace does not have labels for Accurate.
In that case, we need to reconcile all watched resources and check accurate.cybozu.com/propagate-generated
annotation of their parents when an accurate label is added to a namespace.
So @zoetrope and I agreed that we could simply disable this feature by default or even remove it.
What do you think?
@ymmt2005, this is really good news! ๐ I think the end goal should be to remove the feature completely, as it will remove some critical code and make the Accurate operator even simpler and more understandable. It will definitely also simplify/improve RBAC and controller machinery. If I am not mistaken, controller-runtime will by default spin up an informer/cache (if it not already exist) for any client Get/List call. So the lookup of any owner resource can be more costly than anticipated.
How to move forward is really up to you. Our short-term goal would be to have a release of Accurate that allows us to run without this really problematic issue present. We are not prepared to promote Accurate beyond our POC/sandbox cluster without it. So as a start, I would suggest merging #92 and then cutting a release. How does that sound to you?
@ymmt2005 @zoetrope The latest release (with the new feature gate enabled) seems to work very well in our POC cluster, which means we are ready to move forward. ๐ To resolve this issue, I suggest to:
- Deprecate the propagate-generate feature (mainly doc update)
- Promote the
DisablePropagateGenerated
to Beta and enable it by default.
WDYT? If you are willing to accept PRs with these suggested changes, we should be able to remove the feature and clean up the code after a couple of releases.
I'm fine with your suggestions.