OPA process OOM when a bundle contains a large file despite having size_limit_bytes

Question

OPA process OOM when a bundle contains a large file despite having size_limit_bytes

Closed this issue 4 months ago · 7 comments

OPA version: v0.60.0

Short description

When an OPA bundle server serves a tarball file containing a large arbitrary file (for the purpose of the example, a 5GB file), it seems that OPA will attempt to load into memory the bundle several times, despite having a config with a size_limit_bytes of 100000. This would end up crashing the server with OOM errors.

When OPA loads the bundle, it will recognize that there's a file in the tarball that exceeds the size of 100000:

{"level":"error","msg":"Bundle load failed: bundle file 'example.rego' exceeded max size (1000000 bytes)","name":"authz","plugin":"bundle","time":"2024-01-06T23:27:57-05:00"}

However, the RAM started climbing immediately (spiking to 3-5 times the size of the loaded tar ball size). I noticed that the download retry mechanism will retry loading it several times, resulting in high RAM consumption almost immediately (https://github.com/open-policy-agent/opa/blob/main/download/download.go#L210-L252 is what I could pinpoint this to, I could be wrong).

So in essence and in practice, a 5GB bundle that was loaded resulted in a server with 24GB of RAM getting OOM errors, resulting in some sort of denial of service.

Steps To Reproduce

OPA config I used:

services:
  - name: acmecorp
    url: http://192.168.2.243:8080

bundles:
  authz:
    service: acmecorp
    resource: bundle.tar.gz
    persist: false
    size_limit_bytes: 1000000

The tarball is generated and served this way:

dd if=/dev/zero bs=1M count=5000 of=example.rego
tar czvf bundle.tar.gz example.rego
python3 -m http.server 8080

Expected behavior

OPA rejecting the bundle and releasing memory when a file exceeds the size_limit_bytes value limit

The OPA server log in debug mode:

c-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-01-06T23:27:50-05:00"}
{"level":"debug","msg":"maxprocs: Leaving GOMAXPROCS=1: CPU quota undefined","time":"2024-01-06T23:27:50-05:00"}
{"level":"info","msg":"Starting bundle loader.","name":"authz","plugin":"bundle","time":"2024-01-06T23:27:50-05:00"}
{"level":"debug","msg":"Download starting.","time":"2024-01-06T23:27:50-05:00"}
{"headers":{"Prefer":["modes=snapshot,delta"],"User-Agent":["Open Policy Agent/0.60.0 (linux, amd64)"]},"level":"debug","method":"GET","msg":"Sending request.","time":"2024-01-06T23:27:50-05:00","url":"http://192.168.2.243:8080/bundle.tar.gz"}
{"headers":{"Content-Type":["application/json"],"User-Agent":["Open Policy Agent/0.60.0 (linux, amd64)"]},"level":"debug","method":"POST","msg":"Sending request.","time":"2024-01-06T23:27:50-05:00","url":"https://telemetry.openpolicyagent.org/v1/version"}
{"level":"debug","msg":"Server initialized.","time":"2024-01-06T23:27:50-05:00"}
{"headers":{"Content-Length":["5096530"],"Content-Type":["application/gzip"],"Date":["Sun, 07 Jan 2024 04:27:52 GMT"],"Last-Modified":["Sun, 07 Jan 2024 04:26:42 GMT"],"Server":["SimpleHTTP/0.6 Python/3.10.12"]},"level":"debug","method":"GET","msg":"Received response.","status":"200 OK","time":"2024-01-06T23:27:50-05:00","url":"http://192.168.2.243:8080/bundle.tar.gz"}
{"level":"debug","msg":"Download in progress.","time":"2024-01-06T23:27:50-05:00"}
{"headers":{"Content-Length":["216"],"Content-Type":["application/json"],"Date":["Sun, 07 Jan 2024 04:27:54 GMT"]},"level":"debug","method":"POST","msg":"Received response.","status":"200 OK","time":"2024-01-06T23:27:52-05:00","url":"https://telemetry.openpolicyagent.org/v1/version"}
{"current_version":"0.60.0","level":"debug","msg":"OPA is up to date.","time":"2024-01-06T23:27:53-05:00"}
{"level":"error","msg":"Bundle load failed: bundle file 'example.rego' exceeded max size (1000000 bytes)","name":"authz","plugin":"bundle","time":"2024-01-06T23:27:57-05:00"}
{"level":"debug","msg":"Waiting 0s before next download/retry.","time":"2024-01-06T23:27:57-05:00"}
{"level":"debug","msg":"Download starting.","time":"2024-01-06T23:27:57-05:00"}
{"headers":{"Prefer":["modes=snapshot,delta"],"User-Agent":["Open Policy Agent/0.60.0 (linux, amd64)"]},"level":"debug","method":"GET","msg":"Sending request.","time":"2024-01-06T23:27:57-05:00","url":"http://192.168.2.243:8080/bundle.tar.gz"}
{"headers":{"Content-Length":["5096530"],"Content-Type":["application/gzip"],"Date":["Sun, 07 Jan 2024 04:27:59 GMT"],"Last-Modified":["Sun, 07 Jan 2024 04:26:42 GMT"],"Server":["SimpleHTTP/0.6 Python/3.10.12"]},"level":"debug","method":"GET","msg":"Received response.","status":"200 OK","time":"2024-01-06T23:27:57-05:00","url":"http://192.168.2.243:8080/bundle.tar.gz"}
{"level":"debug","msg":"Download in progress.","time":"2024-01-06T23:27:57-05:00"}
Killed

Answer 1 · 2024-01-08T19:48:06.000Z

OPA seems to be doing the right thing here ie. rejecting the large file and returning a bundle load error which in turn should kick-off an exponential back-off delay based successive download. What's happening here probably is the memory is not getting released quick enough? I guess if GC were to happen we'd see memory being released. Go 1.19 introduced a concept of soft memory limit which helps to control GC behavior and it can be set via the GOMEMLIMIT env variable. I haven't played around with this so not sure if it will help but is this something you've looked into?

Answer 2 · 2024-01-09T10:01:25.000Z

@ashutosh-narkar do we really need to load the file at all if the file size exceeds size_limit_bytes? That seems to defeat the point of the setting. I haven't looked into it, but I'm assuming there's a way to check the file size of an item in the tarball before reading the bytes. Is that not the case?

Answer 3 · 2024-01-09T11:13:24.000Z

Looking at the implementation now, and maybe I'm missing someting, but it seems like the NextFile() function greedily reads all files on the first invocation, then serves them from cache on subsequent calls as long as there are more to return. We then call the readFile() on each of the files (which we've already read), which copies them to yet another buffer. Here we use Read with a limit, which will fail if the size limit is exceeded. But at this point we've already read the entire file once, and now we read it again up to the size limit. So if we've set a size limit of 1 GB and we have a file in the tarball which is 5 GB, we'll now have 6 GB of data buffered before we return an error. In case we don't hit the size limit, we'll effectively have each file buffered twice, spending twice as much memory as we're required to.

Some alternative approaches I can think of:

Have NextFile read lazily, i.e. only one file from the tar ball per invocation, and return the reader without copying the buffer. This seems like a reasonable expecation for a function called "NextFile", but it's a breaking change in behavior, as callers would now be required to read the stream of the "file" returned before calling NextFile again.
Keep behavior of NextFile, but store the bytes.Buffer on the descriptor rather than an io.Reader, and make it accessible. That way we can reuse the buffer we've read elsewhere, and avoid the second copy.
Pass the size limit to the tar ball loader, and have it fail immediately when the size reported exceeds the limit, i.e. before we read anything more than the header.

2 and 3 aren't mutually exclusive, but would rather be good to have both done. To avoid keeping two copies of the file, and to not read files exceeding the limit.

As for 2, perhaps there is some more elegant non-intrusive way to do it. I'm open to suggestions.

Answer 4 · 2024-01-09T19:07:43.000Z

Thanke for investigating Anders. Do we need a CVE assigned to this?

Answer 5 · 2024-01-09T19:38:30.000Z

Anytime, Dolev 🙂 Quite pleased with the result!

As for CVE, I'd lean towards no. OPA will need to run under the premise that a remote bundle server can be trusted. If that is not the case, an OOM is about the least harmful thing a malicious actor could accomplish. If they can tamper with the contents of bundles, they could e.g. change an authorization policy to allow them access, or in the case of discovery bundles, changing OPA's configuration to e.g. turn off decision logging or whatnot.

Having the size limit exceeded in real-world deployments is likely going to happen by accident, like where a user accidentally includes a big file by mistake. The fact that a mistake could cause an OOM is of course not good, and I'm happy to see this fixed. But similarly, there are many mistakes one might do as a bundle server "owner" which could have quite severe consequences, so I can't say I'm more worried about this than other imaginable scenarios.

Happy to discuss more if you think differently!

Answer 6 · 2024-01-10T00:20:51.000Z

Hi,

I agree that if a bundle server is compromised, DoS is the least interesting abuse case. But then again, bundle signing is also a thing, so there's some assumption things can get risky even if you're supposedly trusting your bundle server, no?

Totally up to the project to decide at the end of the day :) what's important is that it's fixed!

Answer 7 · 2024-01-10T08:40:24.000Z

You're right — I didn't really make the distinction between whoever creates the bundle and who hosts it, and I should have! Indeed, bundle signing is a good extra measure. Assuming that is in place, the only actor who could accomplish this would be whoever built and signed the bundle, and if that's a malicious actor... not a whole lot we can do then.

Let's leave it as a (soon to be fixed!) issue for now. If others have opinions here, please make your voices heard :)

Thanks Dolev!