Memory usage increasing with # cpus, especially with go1.17

Question

Memory usage increasing with # cpus, especially with go1.17

mattcary opened this issue 2 years ago · 0 comments

Some aspect of the csi driver uses memory proportional to the number of CPUs on its VM. This appears to be worse with go1.17 (which we recently upgraded to).

This was reported by a customer on GKE who saw OOMs at the 50M container limit we use on GKE clusters, on an n2-standard-32 machines. This was associated with a workload that had 10s of volumes attached to the node (so that there was steady stage & mount on the node when workloads were scheduled). Testing on similar workloads on 2 and 4 node machines, and e2 machines, did not replicate the memory usage (it stayed around 18M), but the test workloads consumed over 40M on n2-standard-32 machines.

In order to look into this I added golang profiling to the driver and ran it in the e2e harness, see the following gist for details: https://gist.github.com/mattcary/6f59a77154c2e9ee24eb3c90b78952b6

(for those not familiar, the e2e test runs the driver by itself on a bare GCE instance, interacting directly with grpc.) The full e2e test provisions instances and runs remotely; my new memory test is meant to be run directly on a manually provisioned e2e instance. I ran than driver in a systemd-run container with a 50M limit.

Unfortunately, neither golang's alloc_space nor inuse_space match up well with other memory statistics, for example the k8s container memory metrics. I'm hopeful that they all correlate, though. I focused on inuse_space (since alloc_space seems to be very high in all cases. This reproduced machine cpu dependent memory usage, relatively at least but not in absolute terms.

The e2e memory test runs iterations of creating a volume, attaching it, doing some volume lists, then staging, mounting, unmounting, detaching and deleting the volume. I recorded memory usage by go tool pprof -inuse_space and looking at total usage. Comparing runs with several hundred iterations produced memory usages shown in this graph included in the gist. While the data is noisy, the run on the n2/32 machine shows higher sustained memory usage.

Looking at where the extra memory usage came from was not conclusive. Most memory usage in all runs is in buffer manipulation, server setup, or goroutine management. I only found consistent differences in looking at the totals.

One can control golang's usage of processors by GOMAXPROCS. Running with the default of 0 will cause the runtime to allocate threads or what-not by looking for the number of CPUs present on the machine, or it can be overridden with a specific value. I added a flag to the driver to set GOMAXPROCS at startup (see the gist) and then ran with various values. In addition, since we had recently (beginning of 2022) updated the driver build process to go v1.17, I compared usage of the driver between v1.16 and v1.17.

See the second graph included in the gist. Each plot in the graph is a set of 5 runs on an n2-standard-32 machine. As cryptically expressed in the plot legend, the first run used MAXPROCS set 2 0, the second to 1, and subsequent to 2, 16 and 32. The runs were done sequentially and are not explicitly marked on the graph, but the difference in memory usage is clear. Interestingly, the value of MAXPROCS=0 is closest to that of 2, and not the 32 CPUs present on the machine. The value of 1 is a clear minimum.

Interestingly, the usage of to 1.16 is much less than 1.17, and the difference between MAXPROCS=1 and other settings is more pronounced (also MAXPROCS=16 seems the same as 1 which is odd).

Our recommendation is to set MAXPROCS=1 in order to minimize memory usage. The driver is not cpu-bound, so we do not anticipate this having any affect on performance.

Caveats

This does not exactly replicate what happens on a node, which would only do the staging and mounting operations. But, since I suspect that most memory usage is in grpc and proto manipulation the exact calls probably are not important.
I never saw OOMs when running in the 50M systemd container. Note that the limit is important for accurately for looking at memory usage, as it will affect garbage collection and memory page releasing behavior (see for instance the golang 1.12 madvise free kerfuffle).
In the n2 vs e2 plot, the e2 device actually has the higher spike, which would be more relevant for OOMs if that was actual memory usage from the kernel perspective. But, since this clearly contradicted the observations from the tests on GKE I think that has to be noise.
There are probably changes in the profiler between 1.16 and 1.17, so inuse space may not be comparable across versions.
If the 1.16 vs 1.17 inuse_space measurements are absolutely comparable, then go 1.17 memory usage is just higher than that of 1.16. If they are relative, then the effect of MAXPROCS is more pronounced in 1.17 than 1.16. In either case, setting MAXPROCS=1 seems like a good solution, either as an obvious and easy memory observation, or to fix a regression.