golang/go

runtime/metrics: add /gc/heap/live:bytes

felixge opened this issue ยท 22 comments

There appears to be no way to monitor what is called the Live Heap in the GC Guide.

Screen Shot 2022-11-20 at 04 25 30

The closest runtime/metric seems to be /memory/classes/heap/objects:bytes which is Live Heap + New Heap (i.e. includes potentially dead but unswept allocations).

Monitoring Live Heap seems to be useful when running with GOGC=off and a GOMEMLIMIT to understand if the live heap of a program is approaching the GOMEMLIMIT which could lead to high CPU usage or OOM kills.

Therefor I'm proposing to add /memory/classes/heap/live_objects:bytes and /gc/heap/live_objects:objects to the runtime/metrics package. They should return similar values as what is currently returned by the InUse* values of the runtime.MemProfile() profile. It's probably worth to discuss whether the term InUse is preferable over Live.

I created a very hacky prototype of this here, but there is probably a much better way to do this.

A similar proposal was rejected previously (#51966). Part of the problem was inclusion in MemStats, but runtime/metrics also existed and I made some other arguments against it. Mainly, I think relying on it solely as an indicator can be somewhat fragile if you put it in context of the Go 1.18 GOGC changes. I wasn't ever fundamentally opposed to this, but at the time for the use-case presented, it just didn't seem worth it. I also wanted to think more carefully about what implications it could have (and why we didn't do it already).

However, I do think your argument that it's useful for identifying issues with a memory limit is "new information" (now that the memory limit has landed) as far as the proposal processes goes. I think that might be enough to push this over the edge into more serious consideration for me personally.

I think to start with I'd be inclined at this point to call this /gc/heap/live:bytes. This way it lives next to /gc/heap/goal:bytes. /memory/classes to me is more about a breakdown of memory at a higher level and all the metrics inside there are supposed to be orthogonal. We'd have to break up some of the other metrics and I don't think it's worth it for this.

If we do this, we should probably also export scannable stack bytes and scannable globals bytes, at least as far as they go into the GOGC calculation. We should also really export GOGC and the memory limit as well, since they can change during application execution.

I don't really see enough utility from exporting the count of live objects though. I think a memory profile is better for that sort of thing anyway. (Also, we don't currently track that (mod memory profiling) and it's more work.)

I created a very hacky prototype of this here, but there is probably a much better way to do this.

Yeah, there's a simpler way for bytes. :) You want gcController.heapMarked. You can just read that directly in the metrics callback without any synchronization since it only gets updated during a STW.

Sounds great. I'll update my patch based on your suggestions.

I also wanted to think more carefully about what implications it could have (and why we didn't do it already).

Yeah, that makes sense. Seems like this is missing the code freeze anyway.

I updated the CL to use gcController.heapMarked, but found one odd thing.

If we do this, we should probably also export scannable stack bytes and scannable globals bytes, at least as far as they go into the GOGC calculation. We should also really export GOGC and the memory limit as well, since they can change during application execution.

I can prepare CLs for those as well. Should I submit them separately?

I updated the CL to use gcController.heapMarked, but found one odd thing.

It's an accounting artifact currently required for good performance. It would be nice to fix, but I'm not yet sure how without either (1) impacting the allocation fast path or (2) slowing down metrics.Read to ReadMemStats levels (this path would avoid the global latency impact that ReadMemStats has because it's not necessary to actually, fully stop the world, but the worst-case latency of metrics.Read becomes effectively the same as stopping the world).

For users that really need this consistency (like the testing package), ReadMemStats provides it, so there is a workaround. Most of the time though, the skew doesn't make a very meaningful difference, and the memory stats (in /memory/classes) are all at least guaranteed to be consistent with each other. It might be worth mentioning in the docs that this stat specifically isn't necessarily consistent with the /memory/classes stats. We can lift this restriction in the future if we can figure out a way to flush cached efficiently (I'll keep thinking about it).

If we do this, we should probably also export scannable stack bytes and scannable globals bytes, at least as far as they go into the GOGC calculation. We should also really export GOGC and the memory limit as well, since they can change during application execution.

I can prepare CLs for those as well. Should I submit them separately?

Either or. Your current CL is not very big.

rsc commented

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
โ€” rsc for the proposal review group

rsc commented

Based on the discussion above, this proposal seems like a likely accept.
โ€” rsc for the proposal review group

rsc commented

No change in consensus, so accepted. ๐ŸŽ‰
This issue now tracks the work of implementing the proposal.
โ€” rsc for the proposal review group

@felixge were you planning to implement this?

@felixge were you planning to implement this?

Yes, I'll try to submit a good patch for /gc/heap/live:byte ๐Ÿ‘.

If we do this, we should probably also export scannable stack bytes and scannable globals bytes, at least as far as they go into the GOGC calculation. We should also really export GOGC and the memory limit as well, since they can change during application execution.

If time allows I could also try to look into these, but I'm not sure if they'd be covered by the accepted proposal or would need separate proposals?

That's a good question. I personally feel we should just include those in the proposal.

Change https://go.dev/cl/497315 mentions this issue: runtime/metrics: add /gc/heap/live:bytes

Change https://go.dev/cl/497317 mentions this issue: runtime/metrics: add /gc/heap/goal:percent

Change https://go.dev/cl/497316 mentions this issue: runtime/metrics: add /gc/heap/limit:bytes

Change https://go.dev/cl/497320 mentions this issue: runtime/metrics: add /gc/global/scannable:bytes

Change https://go.dev/cl/497319 mentions this issue: runtime/metrics: add /gc/stack/scannable:bytes

Change https://go.dev/cl/497575 mentions this issue: runtime/metrics: add /gc/scan/heap:bytes

Change https://go.dev/cl/497576 mentions this issue: runtime/metrics: add /gc/scan/total:bytes

All of the metrics discussed in this proposal have landed, see patches above. Closing this ๐Ÿฅณ

Thanks @mknyszek et al. for reviews and discussion ๐Ÿ™‡

Change https://go.dev/cl/497880 mentions this issue: runtime: ensure consistency of /gc/scan/*

Change https://go.dev/cl/498075 mentions this issue: doc: add release note for runtime/metrics additions