open-policy-agent/opa

Failover strategy to hande OPA bundle server issues

rudrakhp opened this issue · 7 comments

What is the underlying problem you're trying to solve?

For multiple reasons, we are serving OPA servers with policies from a custom bundle server, and not S3. Under such circumstances we need to be able to configure a failover strategy to improve availability of OPA.

Describe the ideal solution

Continuing the discussion from a previous issue, ideally OPA should be configurable with multiple bundle servers with a failover/load balancing strategy that user can implement as a plugin.
Example: OPA is configured with 5 bundle services. OPA queries a localhost endpoint /selectBundle(s) that's served by a plugin to get one/subset of the services as a response and goes ahead with the downloading the bundle as it does today from the selected bundle service. The selection strategy is completely implemented and managed by the user and not OPA. OPA can skip making this call if only one bundle is configured to maintain backward compatilbility.

Describe a "Good Enough" solution

A default failover strategy could be to just attempt downloading bundle from each of the servers till one of them succeeds.

Additional Context

The specific strategy we are trying to implement is if the bundle server is down, it falls back to a policy bundle stored in disk. The OPA service is running as a container in a K8s based env. Not sure if it's possible to do so today.

The OPA config can be updated to support multiple services in the bundle config and use some load balancing algorithm. I want to understand your specific use-case a little better

The specific strategy we are trying to implement is if the bundle server is down, it falls back to a policy bundle stored in disk.

Are you suggesting a previous download was successful and now OPA has a bundle it uses. OR you have a bundle on disk and OPA first pulls from the remote server and if that fails it uses the bundle on disk. If it's the later, how are you managing updates to the disk bundle? If you simply want OPA to read a bundle from disk and read that when say the remote is unavailable, you can use the persist option.

@ashutosh-narkar we already use the persist option, which helps if the OPA server is already running and previous bundle download has succeeded. But if OPA server comes up for the first time and bundle server is down, there are no rules for it to run against.
If there is a "default bundle" users can have on disk that keeps updating itself in place as bundle downloads are successful going forward, we can avoid this scenario where OPA has no loaded bundle.

When OPA starts for the first time and bundle downloads are failing, if you setup proper health checks OPA will not serve traffic and hence I would imagine the service that calls OPA will deny the requests on a non-200 status response. Once OPA downloads the first bundle it will get persisted and that's a scenario you've covered.

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

I think that while it'd be possible to add some configuration to OPA to support loading bundles from multiple servers with a failover, this would be hard to do without making the configuration much more complex for what's already a non-trivial feature to configure.

I guess I'd be interested to hear justification for this feature that shows why a highly available bundle server isn't sufficient in some cases. Mostly users are loading bundles from places like S3, DAS, an OCI registry or some replicated custom service. All these options are or have HA deployment options.

@charlieegan3 I guess the ask here was for OPA to support redundancy in bundle servers, especially if the bundle server implementation has multiple dependencies that don't guarantee HA. But we have been exploring ways to improve the bundle service availability instead. I think this should not be an immediate requirement anymore. Thanks for all the relevant pointers though @ashutosh-narkar !

Ok, thanks for clarifying @rudrakhp! I'll close this one for now and if there does turn out to be something you can't achieve with a more reliable bundle server, then we can re-open this one.