golang/go

syscall: caching RLIMIT_NOFILE is wrong implementation

ls-ggg opened this issue · 35 comments

Go version

go version go1.20.13 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.13"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1954436029=/tmp/go-build -gno-record-gcc-switches"

What did you do?

RLIMIT_NOFILE is cached in the go runtime, and RLIMIT_NOFILE in the cache is restored for the child process before exec.

Assume that the system's default RLIMIT_NOFILE is 1024, and process A generates child process B through fork. B is a go application, so the go runtime will cache RLIMIT_NOFILE as 1024. When I set B's RLIMIT_NOFILE to 10000 through prlimit in A, and use exec.Command to create process C in B, C's RLIMIT_NOFILE is restored to 1024.

This behavior is incompatible with the operating system. It also caused problems with runc:
opencontainers/runc#4195

The problem originates from f5eef58

What did you see happen?

Reference opencontainers/runc#4195

What did you expect to see?

RLIMIT_NOFILE should be correctly inherited to child processes

Please don't show us code in images. Link to go.googlesource.com or GitHub instead. Images are difficult to read. Thanks.

If I understand this correctly, the problem is that if process A uses prlimit to change the resource limit in process B, then the Go standard library isn't aware of that when process B starts process C.

We can't change the current behavior in which process B passes the original RLIMIT_NOFILE value on to process C. That is likely to break existing programs that do not expect a large RLIMIT_NOFILE, as discussed in #46279. That issues shows real failure cases when we didn't do that, so we can't change back.

So I think the question is whether there is a way that process B could detect that the process limit was changed by some other process via prlimit. In that case we should pass down the updated value, just as we already do if process B explicitly calls syscall.Setrlimit itself.

Please don't show us code in images. Link to go.googlesource.com or GitHub instead. Images are difficult to read. Thanks.
Ok, I've replaced the image with a commit link

I think there is also a get/set/set race situation here, please see opencontainers/runc#4266 and opencontainers/runc#4265 .


Go CL 393354 (in Go 1.19beta1) introduced raising the RLIMIT_NOFILE soft limit to the current value of the hard limit.

Further CLs (in Go 1.21, but backported to Go 1.20.x and 1.19.x) introduced more code in between getrlimit and setrlimit syscalls (see Go's src/syscall/rlimit.go).

In runc exec, we use prlimit(2) to set rlimits for the child (runc init) some time after we start it. This results in the following race:

runc exec (parent)                 runc init (child)
------------------                 -----------------
(see (*setnsProcess).start         (see init in src/syscall/rlimit.go)

                                   getrlimit(RLIMIT_NOFILE, &lim)
prlimit                            ....
                                   setrlimit(RLIMIT_NOFILE, &nlim)

So, I think we need to find a way to solve two problems:

  1. A get/set/set race with prlimit;
  2. How to know setted nofile limit by other process using prlimit.

So I think the question is whether there is a way that process B could detect that the process limit was changed by some other process via prlimit. In that case we should pass down the updated value, just as we already do if process B explicitly calls syscall.Setrlimit itself.

What's the progress now?
I think maybe there is no quick way to fix this issue? How about provide a public function in runtime to clear the cache? Let the user to decide whether the cache should be cleared or not.

We already have a public function to clear the cache:

    var r syscall.Rlimit
    syscall.Getrlimit(syscall.RLIMIT_NOFILE, &r)
    syscall.Setrlimit(syscall.RLIMIT_NOFILE, &r)

We already have a public function to clear the cache:

    var r syscall.Rlimit
    syscall.Getrlimit(syscall.RLIMIT_NOFILE, &r)
    syscall.Setrlimit(syscall.RLIMIT_NOFILE, &r)

We have considered this method, but it will cause new get/set race.

This only matters just before starting a child, so do it then. If that is a race, then you have a race problem no matter what.

AGWA commented

Maybe syscall.SysProcAttr should contain fields for setting the child's rlimits. runc could use that instead of the inherently racy practice of calling prlimit after starting the child.

Hi @lifubang, do you agree with what Ian wrote in #66797 (comment), or if not, could you briefly expand on why?

For the people reporting this here, is this something that affects Go code primarily consumed as a library, or package main? If there is not a valid workaround or solution in the near term, I wonder if a GODEBUG controlling the behavior could be temporary solution to help with backwards compatibility, or maybe that does not make sense?

if not, could you briefly expand on why?


Sorry for delay, because I'm in another busy thread.
I think it's easy to write a repro code, if someone has a intresting, it will be appreciated.

The explanation is like that:
If we use the way provided by Ian(and @ls-ggg) in process B which created by process A, this will cause a new race:

B:syscall.Getrlimit(syscall.RLIMIT_NOFILE, &r)
A:prlimit(B, a different nofile_rlimit)
B:syscall.Setrlimit(syscall.RLIMIT_NOFILE, &r)


Besides this, this way also causes us doing two unneeded syscall.

The core reason is that using Get/Set to clear the internal cache is not a atomic operation, I suggest to add a public method to clear this cache directly. For example: runtime.ClearSyscallRlimitCache().

I understand the race. My point is that this only matters when you are about to start a child process. So do it then. There is already a race when starting a child process: you can start the child before or after the other process calls prlimit. So the race that you mention doesn't make matters any worse.

Two additional system calls are trivial, it's not worth adding a very special purpose operation to avoid that.

We already have a public function to clear the cache:

    var r syscall.Rlimit
    syscall.Getrlimit(syscall.RLIMIT_NOFILE, &r)
    syscall.Setrlimit(syscall.RLIMIT_NOFILE, &r)

@ianlancetaylor the problem is, this relies on setrlimit(2) syscall returning no error, which is not always the case (for example, when we're in user namespace, we can't change rlimits for ourselves).

Perhaps something like this will fix the issue for us:

diff --git a/src/syscall/rlimit.go b/src/syscall/rlimit.go
index d77341bde9..8184f17ab6 100644
--- a/src/syscall/rlimit.go
+++ b/src/syscall/rlimit.go
@@ -39,11 +39,10 @@ func init() {
 }
 
 func Setrlimit(resource int, rlim *Rlimit) error {
-       err := setrlimit(resource, rlim)
-       if err == nil && resource == RLIMIT_NOFILE {
+       if resource == RLIMIT_NOFILE {
                // Store nil in origRlimitNofile to tell StartProcess
                // to not adjust the rlimit in the child process.
                origRlimitNofile.Store(nil)
        }
-       return err
+       return setrlimit(resource, rlim)
 }

I'm OK with that kind of change.

Change https://go.dev/cl/587918 mentions this issue: syscall: rm go:linkname from origRlimitNofile

Change https://go.dev/cl/588076 mentions this issue: syscall: Setrlimit: always clean rlimitNofileCache

Perhaps something like this will fix the issue for us:

I think this needs more discussion.
Setting nofile rlimit error means we don’t set it successfully, if we clear it at this time, it will cause the new child process’s nofile rlimit in a wrong value state.

If I’m saying wrong, please let me know.

I think this needs more discussion. Setting nofile rlimit error means we don’t set it successfully, if we clear it at this time, it will cause the new child process’s nofile rlimit in a wrong value state.

If I’m saying wrong, please let me know.

Perhaps something like this will fix the issue for us:

I think this needs more discussion. Setting nofile rlimit error means we don’t set it successfully, if we clear it at this time, it will cause the new child process’s nofile rlimit in a wrong value state.

If I’m saying wrong, please let me know.

Perhaps something like this will fix the issue for us:

I think this needs more discussion. Setting nofile rlimit error means we don’t set it successfully, if we clear it at this time, it will cause the new child process’s nofile rlimit in a wrong value state.

If I’m saying wrong, please let me know.

Whoever calls Setrlimit will return an error in this case. They will know they are not able to set the limits.

In our (runc) use case, we just need to remove the cache. We can set the limit as well, but that's optional, since we will use unix.Prlimit as we always did to set the limit.

To me, this is the best middle ground between Go and runc needs.

PS we can discuss it further in opencontainers/runc#4290 (i'm going to add go tip to its CI).

To me, this is the best middle ground between Go and runc needs.

Yes, I think this is enough for runc, but not enough for golang.

Whoever calls Setrlimit will return an error in this case.

Soft > Hard

I don't think it matters much. If Setrlimit fails for an ordinary call, it's likely to also fail when it is called while fork/execing a child.

I am reluctant to add new API for this very special case. It is very unusual for a program to care about the rlimit for child processes. It is very unusual for a program to change its rlimit and fail. The new API is only required for the intersection of those two very unusual cases.

One other issue is that Go runtime's caching of RLIMIT_NOFILE is based on the assumption that setrlimit is the only way to set the rlimits of a process. However, prlimit lets you set limits from a different process, and doing this doesn't update the cache because the Go runtime doesn't have any way of knowing that it happened (this is what we do in runc).

This causes several issues (one of which is the series of race conditions and logic issues we have now fixed in runc with stdlib workarounds by telling the Go runtime that we plan to change our RLIMIT_NOFILE with a dummy SetRlimit call) but the general takeaway I have is that this auto-adjustment logic isn't appropriate for general-purpose Go programs.

@cyphar I tend to take the opposite view, which is that the auto-adjustment logic works well for general-purpose programs, but not for special-purpose environments. Across all uses of Go, using prlimit to change the rlimit of a different process is a very rare operation.

That said, we are certainly open to a different approach if you have any suggestions.

Is Go the right language for something like runc, or should runc be deprecated in favor of crun? A container runtime must operate in extremely unusual environments and I’m not sure if it is reasonable to expect Go’s runtime to work there. C, C++, and Rust all seem like better choices here. It is true that C and C++ are not memory safe, but I expect most security problems in a container runtime to be logic errors interfacing with the Linux kernel, not memory safety problems. Given what a container runtime does, a logic error and memory corruption can easily lead to equally severe vulnerabilities.

Go is advertised as a systems language and in 2013-ish the decision was made to write Docker in Go, which lead to a lot of large projects using Go (such as most of the container ecosystem -- I would argue that Docker's usage of Go helped Go's popularity in the early days). Many of those projects use runc's code directly or use runc (because runc is by far the most commonly used container runtime). Deprecating it wouldn't make any sense.

The "maybe Go isn't the right language for this" ship has long since sailed. If we could travel back to 2013, I would argue that Go is the wrong language to use -- but at the time Go was the only language that made sense to use (Rust wasn't ready, and writing Docker in C/C++ would've lead to more security issues because Docker is a privileged daemon that has to deal with far more things than runc).

FWIW my experience is that Go isn't really a "systems language", it's more of a language for writing tools and web servers that don't require a lot of fine-grained control that system languages provide. But that isn't what it was advertised as in 2013, and even today Go is still sold as a "systems language" (it seems obvious now that that term means different things to different people).

When I submit bug reports as a runc maintainer, I'm more than aware that we are not seen as a welcome member of the Go community because we are always seen as "doing weird things" (such as being put on the "wall of shame" for trying to work around problems in the Go runtime that are not documented and took us ages to figure out how to fix). I try to avoid opening bug reports for these kinds of issues because I know that they will just be ignored. Being told "maybe you should just shut up shop and go somewhere else" is a little hurtful, to be honest.

It is true that C and C++ are not memory safe, but I expect most security problems in a container runtime to be logic errors interfacing with the Linux kernel, not memory safety problems.

Sure, but adding an additional basket of possible vulnerabilities isn't a good thing. We have to do a lot of parsing of data and other operations where memory safety helps us avoid dumb security issues. I suspect that most programs could have a large number of security issues from non-memory-safety bugs, but I'm sure we agree that memory safe languages should be used when possible? Fewer possible bugs is better, right?


For this particular issue, it seems very strange to argue that a process doing prlimit on your program is an "unusual" thing to happen for programs written in a "systems language". But again, I guess we have different definitions of what a systems language is.

I'm sorry that you don't feel like a welcome member of the community. Certainly my goal is to work with tools like runc to find solutions that will work for the whole community.

Using go:linkname to reach into the runtime is an example of something that simply doesn't work for the whole community. That doesn't mean that the problems can't be solved. It means that go:linkname is the wrong solution.

I stand by my claim that using prlimit to change the rlimit of a different process is a very rare operation. As I've said, I am certainly open to figuring out a way to make this work for you. After all, this issue is still open. But I don't think the answer is "the way that we do things is the only way that matters." There are other cases that matter, whether we consider Go to be a systems language or not.

For RLIMIT_NOFILE the Go runtime is trying to handle a real problem that real programs were running into in practice (#46279). The caching to reset the limit for child processes, also discussed on that issue, is also there to handle a real problem. And you also have a real problem. So let's figure out a way to solve all those problems, rather than saying that some of them don't matter.

AGWA commented

@cyphar If os/exec could configure the child's rlimits before execing it (via some new fields in syscall.SysProcAttr), would that eliminate runc's need to use prlimit?

Using go:linkname to reach into the runtime is an example of something that simply doesn't work for the whole community. That doesn't mean that the problems can't be solved. It means that go:linkname is the wrong solution.

We have opened RFCs to try to solve these problems without using go:linkname (such as #67639), it's just that the first version of the fix usually does require go:linkname because we sometimes run into cases where that really is the only way to fix an immediate problem without going through a long RFC process (we'd need #67639 to fix a CVE if we couldn't go:linkname).

For RLIMIT_NOFILE the Go runtime is trying to handle a real problem that real programs were running into in practice (#46279). The caching to reset the limit for child processes, also discussed on that issue, is also there to handle a real problem. And you also have a real problem. So let's figure out a way to solve all those problems, rather than saying that some of them don't matter.

I never said they don't matter, I was saying that the current way of doing it and the fixes that have been applied to it are flawed because they're designed around the idea that Go processes will always be the process that sets their own rlimits. I don't think this is true in general -- sysadmins sometimes need to change process rlimits for long-running daemons and might not expect their settings to not apply to children (unlike Unix programs written in other languages). (runc is not a long-running daemon, this is just something that seems like a problem to me in general.)

We can work around this in runc because while the same process doesn't set its own rlimit, both processes are runc and so we can co-ordinate things (which we've already done -- this issue is no longer a problem for us). My concerns are purely based on wanting to make sure that this issue doesn't bite other users where they can't work around the problem.

A "good enough" solution would be to save the rlimits set by Go during init and then check if the process rlimits are different when doing exec. If the value is different (or the Go process tried to do Setrlimit), the Go runtimne shouldn't touch them. There is theoretically a small TOCTOU race but you could argue that there would be a race anyway between a sysadmin doing prlimit and the fork+exec. This would solve the problem for every case except where a sysadmin sets the rlimit to be the same as the current rlimit -- unfortunately I don't see a way to work around this but at least it's a somewhat unlikely thing for a sysadmin to do in practice.

If os/exec could configure the child's rlimits before execing it (via some new fields in syscall.SysProcAttr), would that eliminate runc's need to use prlimit?

Maybe, but we have already fixed the issue for runc (by doing a dummy Setrlimit which works now that Go will clear the cache regardless of whether it fails) and so we don't have a strong need for such a feature. If it existed, we might switch to it but I'm not sure (if a user sets really restrictive limits, we probably don't want to apply them to the entirety of runc init because it could lead to errors caused by runc not the container workload -- this is something we have to consider for cgroups, but this might also just be a theoretical concern).

However, that wouldn't help with the case where a sysadmin does a prlimit, which is the case I'm worried about (not from runc's standpoint, but as a Go programmer that has written long-running daemons that could be adjusted by a sysadmin this way).

I never said they don't matter, I was saying that the current way of doing it and the fixes that have been applied to it are flawed because they're designed around the idea that Go processes will always be the process that sets their own rlimits. I don't think this is true in general -- sysadmins sometimes need to change process rlimits for long-running daemons. (runc is not a long-running daemon, this is just something that seems like a problem to me in general.)

Quite agree.


I'm sorry that you don't feel like a welcome member of the community. Certainly my goal is to work with tools like runc to find solutions that will work for the whole community.

I think cyphar's original mean is that Runc maintainers always submit issues which are difficult to resolve in go community, and then we had to decide to using some weird ways to fix them in my own, for example go:linkname. I think this is a figure of speech, not a real feeling.
We know that go:linkname is not recommended, but we just only take it for a short term solution to give golang community more time to fix.

Yes, we didn't want to submit these issues for no reasons, because that are really problems met by runc. The goal of submitted them to golang community is to fetch some possible helps from others. I think golang maintainers gave us lots of solutions, though some of solutions are not perfect. We should say thanks to these nice persons.

We understand that these issues filed by us are not critical for golang, golang maintainers have lots of more important things to do. So we should have a thorough discussion instead of rushing to fix it with imperfect methods, though these are not cared by other golang users.


So let's figure out a way to solve all those problems

For this specific issue, I think the original implementation(#46279) is not recommended. This is the core reason of this discuss. I think that if we want to raise the soft limit of nofile, we should let the users to do this, we should not do it silently in runtime. It's the responsibility of the user, but not the runtime. I think a language runtime should provide more and more controllable functions for the users, but not make a decision for them. Once we have a wrong beginning, we need to add more imperfect implementations to fix some unimportant issues, it's not worth to do.

My Suggest way: I think we should raise this soft limit in their own tools when they need to do it, for example: cmd/gofmt, cmd/go, but not in go runtime.

Thanks.



@lifubang Thanks, but we came to a decision on #46279 and we would need a very good reason to reverse that decision. The current situation seems better for the vast majority of Go programs even if it is worse for a few.

@cyphar Thanks for the suggestion. What do you think of https://go.dev/cl/607516?

Change https://go.dev/cl/607516 mentions this issue: syscall: honor prlimit set by a different process