golang/go

runtime/trace: flight recording

Closed this issue ยท 80 comments

Proposal: runtime/trace flight recording

Updated: 23 September 2023

Background

"Flight recording" is a technique in which trace data is kept in a conceptual circular buffer, flushed upon request. The purpose of this technique is to capture traces of interesting program behavior, even when one does not know ahead of time when that will happen. For instance, if the web service fails a health check, or the web service takes an abnormally long time to handle a request. Specifically, the web service can identify such conditions when they happen, but the programmer setting up the environment can't predict when exactly they will occur. Starting tracing after something interesting happens also tends not to be useful, because the program has already executed the interesting part.

The Java ecosystem has had this for years through Java's flight recorder. Once the JVM's flight recorder is enabled, the JVM can obtain a trace representing the last few seconds of time. This trace can come from triggers set up in JMX, or by passing a flag to the JVM that dumps a trace on exit.

With the implementation of #60773 now approaching a stable point, hopefully in Go 1.22 we'll have all traces become series of self-contained partitions. This implementation change presents an opportunity to easily add something similar to the Go execution tracer by always retaining at least one partition that can be snapshotted at any time.

This is also enabled by work in the Go 1.21 release to make traces dramatically cheaper. Because flight recording relies on waiting until something interesting happens, tracing needs to be enabled for a much longer period of time. Enabling flight recording across, for example, a small portion of a production fleet, becomes much more palatable when the tracing itself isn't too expensive.

Design

The core of the design is a new API in the runtime/trace package to enable flight recording. This means that programs can be instrumented with their own triggers.

package trace

// FlightRecorder represents a flight recording configuration.
//
// Flight recording can be thought of as a moving window over
// the trace data, with the window always containing the most
// recent trace data.
//
// Only one flight recording may be active at any given time.
// This restriction may be removed in the future.
type FlightRecorder struct {
    ...
}

// NewFlightRecorder creates a new flight recording configuration.
func NewFlightRecorder() *FlightRecorder

// SetMinAge sets a lower bound on the age of an event in the flight recorder's window.
//
// The flight recorder will strive to promptly discard events older than the minimum age,
// but older events may appear in the window snapshot. The age setting will always be
// overridden by SetMaxSize.
//
// The initial minimum age is implementation defined, but can be assumed to be on the order
// of seconds.
//
// Adjustments to this value will not apply to an active flight recorder.
func (*FlightRecorder) SetMinAge(d time.Duration)

// MinAge returns the current MinAge setting.
func (*FlightRecorder) MinAge() time.Duration

// SetMaxSize sets an upper bound on the size of the window in bytes.
//
// This setting takes precedence over SetMinAge.
// However, it does not make any guarantees on the size of the data WriteTo will write,
// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
// as a hint.
//
// The initial size is implementation defined.
//
// Adjustments to this value will not apply to an active flight recorder.
func (*FlightRecorder) SetMaxBytes(bytes uint64)

// MaxBytes returns the current MaxBytes setting.
func (*FlightRecorder) MaxBytes() uint64

// Start begins flight recording. Only one flight recorder and one call to trace.Start may be active
// at any given time. Returns an error if starting the flight recorder would violate this rule.
func (*FlightRecorder) Start() error

// Stop ends flight recording. It waits until any concurrent WriteTo calls exit.
// Returns an error if the flight recorder is inactive.
func (*FlightRecorder) Stop() error

// Enabled returns true if the flight recorder is active. Specifically, it will return true if Start did
// not return an error, and Stop has not yet been called.
// It is safe to call from multiple goroutines simultaneously.
func (*FlightRecorder) Enabled() bool

// WriteTo takes a snapshots of the circular buffer's contents and writes the execution data to w.
// Returns the number of bytes written and an error. Only one goroutine may execute WriteTo at a time. 
// An error is returned upon failure to write to w, if another WriteTo call is already in-progress,
// or if the flight recorder is inactive.
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error)

Implementation

Most of the implementation work has already been done. I think it would be OK to ship the implementation in golang.org/x/exp/trace, even though it has some efficiencies (like, having to copy buffers outside the runtime). We could make this more efficient by reference-counting the runtime buffers to avoid a copy.

The main thing that definitely needs to change, however, is that the flight recorder needs to be able to run simultaneously with a call to trace.Start, which is currently not possible, since golang.org/x/exp/trace uses trace.Start itself. This can be implemented by having the trace reader goroutine (misnomer) write buffers to multiple writers. All we need to do is call traceAdvance simultaneously with adding a new writer, so that the new writer always starts receiving data on a generation boundary.

Discussion

SetMinAge and SetMaxBytes could give more rigid guarantees, but it's both complex to implement and not terribly useful. The primary use-case for SetMinAge is to allow users to ask for longer traces (for example, if a web service's "long request" means something much longer than a handful of seconds). Meanwhile the primary use-case of SetMaxBytes is to control memory overheads and limit the damage caused by a large SetMinAge.

WriteTo could allow multiple goroutines to call it since it could easily serialize them internally. However, this can create some bad situations. For instance, consider some snapshot trigger condition that causes multiple goroutines to call WriteTo. The call is heavyweight and they'll queue up behind each other; the longest one will likely take quite a while to resolve, and the application will be significantly disrupted. It'll also produce traces that aren't very useful (consisting of short partitions corresponding approximately to the duration of the last WriteTo call) unless we also allow for multiple goroutines to read the same partition's buffers. However, that's going to be fairly complicated to implement, and also doesn't really add much value either, since it's just duplicate data. The current design encourages callers reduces the risk of run-time panics while also side-stepping these issues by returning an error in this case.

Alternatives considered

External circular buffer

@dominikh has suggested adding a similar feature to gotraceui. Because the partitioning is actually baked into the trace's structure, it's possible for trace consumers to implement something similar themselves. The only thing missing is a standard streaming endpoint for execution traces (see follow-up work).

However, there are a few advantages to adding support directly to the standard library and runtime.

  • It opens the door to taking trace snapshots before a program crashes, which is going to be tricky to make work in general from outside the runtime.
  • It's more efficient in that no trace data needs to be processed (or sent to another process, or over the network) until it's actually needed.
  • Any external circular buffer has to make decisions solely on trace content. There's a minor ease-of-use improvement for those doing ad-hoc debugging, since they don't have to model their conditions in the runtime/trace package's annotations, but can instead decide when to grab a snapshot directly.
  • The possibility for more control over trace partitioning (via SetPeriod and SetSize). Any external solution will be at the mercy of the runtime's defaults.

Despite these advantages, it's likely worth pursuing support for such a use-case even if the API described in this proposal is made available. A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic. Handling traces externally also means the ability to perform ad-hoc analyses without the need for additional instrumentation.

Follow-up work

Add support for trace streaming endpoints to net/http/pprof

As stated in the discussion of the "external circular buffer" alternative, we could support that alternative easily and well by just adding a standard debugging endpoint for streaming trace data. It probably makes the most sense to just add new query parameters to the existing trace endpoint; the details of that can be hashed out elsewhere.

I agree with all of the advantages of having flight recording in the runtime.

As for

A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic

I think that program logic will be the far more useful trigger for the majority of use cases, and certainly easier to make use of.
Program logic can also make use of all the exposed runtime metrics, bringing it closer to the power of triggering on trace data.

On top of that, this flight recorder will be much more straightforward to integrate with existing telemetry solutions, compared to external solutions, which will either not be designed for that (Gotraceui) or will have to be written first.

It opens the door to taking trace snapshots before a program crashes

Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT?

rhysh commented

Nice!

How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for SIGPROF deliveries during the whole time the flight recorder is running, without preventing normal on-demand use of runtime.StartCPUProfile.

The panics in WriteTo look like they'd force those to coordinate with calls to Stop (to avoid the "flight recorder isn't running" panic). Could they instead return errors if there's another WriteTo call, and end early (with an error) if there's a concurrent Stop call? Panic risk like this makes me nervous; I think instrumentation in general should Do No Harm.

consisting of short partitions corresponding approximately to the duration of the last WriteTo call

You've said that the data from WriteTo calls won't overlap. But on the other side, do the partitions in the new data format include identifiers (partition sequence number $X from thread/M $Y, and a list of all Ms?) that allows stitching together traces that are known to have no gaps from a set of shorter WriteTo calls?

Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT?

It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately GOMEMLIMIT also doesn't help, because GOGC=off GOMEMLIMIT=<value> is a valid configuration. In that configuration, the program is always at GOMEMLIMIT from the perspective of the operating system.

With that being said, there are still some best-effort things we might be able to do. Programs that don't use GOMEMLIMIT, can look at their own runtime/metrics stats and dump a trace when it total memory use (the same expression that the GOMEMLIMIT machinery is using to account for total memory) exceeds some threshold. For programs that do use GOMEMLIMIT, it's possible to use the new live heap metric to do something similiar. Something that doesn't account for newly allocated data, which will give an indication of how close the live memory of the program is to the limit.

Either way though, it's still best-effort. Unless the snapshotting fully stops the world, the program may continue executing and OOM before the trace actually gets fully dumped. (Even then, it's still possible (but less likely) that it OOMs before the runtime successfully stops every running goroutine.)

Once upon a time there was a patch proposed to Linux to allow for halting a process when it hit container limits, so that another process in a tiny memory reservation could inspect it. One could imagine that if this other process created a core dump of the halted process, we could write some code to extract any active trace buffers from the core dump into a trace.

I'm not sure where this leaves us. Perhaps it suggests that WriteTo should really stop-the-world, and do all the copying during the stop-the-world, so that these difficult cases are more deterministic. But that hurts other cases like "I just want to know why my request is taking a long time sometimes," since it has a bigger run-time performance impact, especially if your intent is to capture multiple such cases (i.e. leave it running in production for a while) and aggregate them. And in the end, it's also just a patch on the broader problem which is that it's surprisingly hard to just get information about an OOM on Linux to begin with.

(Sorry, mis-click.)

Nice!

How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for SIGPROF deliveries during the whole time the flight recorder is running, without preventing normal on-demand use of runtime.StartCPUProfile.

If you happen to have CPU profiling running, it'll just work, but you make a good point that there's no good way to have it included all the time (i.e. no intention of producing a CPU profile). It seems to me like that should maybe be another option, either on the FlightRecorder, or on some new CPU profile configuration value. I'm not sure where it would be better to add this so that it composes well. My gut tells me it should go on the CPU profile API. (One could imagine that CPU profiling could have a similar mode, where the profiling data is just kept in the buffer and snapshotted every once in a while. That would mirror flight recording, just like trace.Start mirrors StartCPUProfile (as you point out).)

The panics in WriteTo look like they'd force those to coordinate with calls to Stop (to avoid the "flight recorder isn't running" panic). Could they instead return errors if there's another WriteTo call, and end early (with an error) if there's a concurrent Stop call? Panic risk like this makes me nervous; I think instrumentation in general should Do No Harm.

Hm, that's a good point. I'll update the API to return an error on Stop and update the reasons WriteTo could fail.

consisting of short partitions corresponding approximately to the duration of the last WriteTo call

You've said that the data from WriteTo calls won't overlap. But on the other side, do the partitions in the new data format include identifiers (partition sequence number $X from thread/M $Y, and a list of all Ms?) that allows stitching together traces that are known to have no gaps from a set of shorter WriteTo calls?

There's currently no list of all the Ms per generation because we don't have M events, but yes, everything within a partition is namespaced by that partition's "generation number." The proposed trace parsing API exposes partition changes as a Sync event, so it'll be possible to identify when it happens. The number won't be exposed, but users of the API can assign their own identifiers to each partition. (If you're parsing the trace directly, then yeah the generation number will be available.)

FWIW, the trace parsing API already does this exact kind of "stitching." Every partition is an entirely self-contained trace, which means all goroutines (and Ps) and their statuses get named in every partition. The trace parser uses this information to validate the stitching: new partitions' goroutine statuses need to match where that goroutine left off in the previous partition.

rsc commented

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
โ€” rsc for the proposal review group

rsc commented

I think this is waiting on an implementation and experience from using that implementation.

That's correct. I plan to have one in golang.org/x/exp/trace soon.

I made a mistake in the commit message so gopherbot couldn't connect the CL to this issue, but the experimental API has now landed via https://go.dev/cl/550257. It is available in golang.org/x/exp/trace for Go 1.22 only.

There are a few caveats with this implementation that will not be true with a real runtime implementation.

  • Higher CPU overhead: because this implementation lives outside the runtime, it needs to keep a copy of the latest trace data. Copying that data should be less expensive than writing out trace data via HTTP, so I still generally expect CPU overheads to be about the same as just having tracing enabled, which is good. It can get slightly better than that with a runtime implementation.
  • Higher memory overhead: again, because we need to keep a copy of the latest trace data, expect the memory overheads to be higher than what the actual implementation can achieve. It's probably not a full 2x because the runtime will have to keep trace data around longer itself, but the golang.org/x/exp/trace implementation has to wait a little longer to throw out old trace data. I wouldn't expect this to be more than a few MiB for most applications, though.
  • Higher snapshot latency: the latency of the actual snapshot operation (WriteTo) is higher because the implementation actually needs to make 2 consecutive global buffer flushes (the results of both appear in the output, so in practice it shouldn't make a difference for analysis). This is because the end of a trace generation isn't explicitly signaled but derived from new trace data. This was perhaps a mistake in the format that can be fixed in a backwards-compatible way, but the runtime implementation has complete knowledge of when a generation starts and ends, so this won't be a problem in the future.

I don't think any of these are significant enough to detract from the usefulness of the experiment, but I wanted to bring it up in case one of these does become an issue. We can also explore ways to improve the experiment to make it more representative, if one of them is indeed a problem.

Please give it a try!

The x/exp/trace flight recorder seems to work nicely for me so far. Thanks for adding it!

One limitation I encountered is that the runtime/trace package does not allow multiple trace.Start calls at the same time, so I needed to disable capturing runtime traces in the DataDog library with DD_PROFILING_EXECUTION_TRACE_ENABLED=false env variable. I wonder if we could enable multiple consumers of the trace data at the same time. cc @felixge

As for preserving traces after OOM, one possible approach could be to use a separate agent process with a shared memory buffer to write the traces into. That agent could then detect that the main process crashed and write the contents of the memory buffer into a file or other storage. However, this would require keeping all the flight recorder data in a single pre-allocated buffer.

Thanks for trying it out!

One limitation I encountered is that the runtime/trace package does not allow multiple trace.Start calls at the same time, so I needed to disable capturing runtime traces in the DataDog library with DD_PROFILING_EXECUTION_TRACE_ENABLED=false env variable. I wonder if we could enable multiple consumers of the trace data at the same time.

Yeah, that's a limitation of the experiment which lives out-of-tree for convenience, but it technically shouldn't be necessary if it's implemented in the runtime. I'd hoped to allow one of each type of consumer (one active flight recorder, one active trace.Start).

rhysh commented

I wonder if we could enable multiple consumers of the trace data at the same time.

@mknyszek , we've discussed allowing the caller to control the set of events that the trace includes. That would enable runtime maintainers to use the execution trace to grant access to events that are niche and expensive, or in the other direction would enable low-overhead access to a subset of events.

Maybe this is how we build/improve delta profiles for #67942, and how we give access to goroutine tags/labels in monotonic profiles (heap, mutex, etc) for #23458.

If that's the direction we go, I think we'll want either a/ multiple trace.Start consumers (trace.(*Config).Start?), each with their own set of events or b/ a way to change the set of events without restarting the trace, so a user-space package like x/exp/trace can be the single subscriber on behalf of multiple others within the program.

Multiple trace.Start consumers isn't too hard to do (each new trace.Start registers another place for the trace reader goroutine to write to, and calls traceAdvance if the trace is already running to get a consistent starting point). But customizing each one is harder.

The only way I can imagine this working efficiently is via something like https://go.dev/cl/594595. TL;DR: We have some light infrastructure for emitting "experimental" events now, and this CL formalizes that a bit better by allowing us to split the event stream. It's at least a step toward that. If we have separate buffer streams, you could imagine a new registration consists of an io.Writer and a bitmask indicating which streams to accept data from (internally).

It's certainly not off the table, but I think we're a little ways off still.

It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately GOMEMLIMIT also doesn't help, because GOGC=off GOMEMLIMIT= is a valid configuration. In that configuration, the program is always at GOMEMLIMIT from the perspective of the operating system.

I'd like to point out that in Linux it's now possible to be actively informed by the kernel through an epoll event if a control group is under memory pressure through the PSI subsystem.

https://systemd.io/MEMORY_PRESSURE/
https://docs.kernel.org/accounting/psi.html#monitoring-for-pressure-thresholds

This could be used as a signal to dump a trace and/or to trigger GC proactively.

rsc commented

This has been "waiting for implementation experience". Do we have enough experience to think about whether to add this to Go 1.24?

We've had some experience with flight recording inside Google, and we've also gotten some good feedback from @martin-sucha.

I think a few things are clear from our experiences so far:

  • The feature is useful. We've used it to peer at program behavior that was quite difficult to capture before.
  • Allowing at least one flight recorder and one trace.Start caller is crucial to smooth usage.
  • Control over the size of the recorder's window is also important, since a good size tends to be application-dependent and has a cost.
  • Rolling this out would likely be helped by net/http/pprof integration, which is going to be a little complex but possible. (But that likely needs its own proposal.)

Much of this can be refined and improved once we have an implementation that lives fully inside the runtime.

Lastly, from my point of view, I don't know what other feedback we would be waiting on that this point.

CC @cagedmantis

rsc commented

Have all remaining concerns about this proposal been addressed?

The API is at #63185 (comment) in the "Design" heading.

rhysh commented
// Adjustments to this value will not apply to an active flight recorder, and will not apply
// if tracing is already enabled via trace.Start. All tracing must be stopped and started
// again to change this value.

Those comments on SetPeriod and SetSize seem like they'd have odd implications. The docs sound like it's hard to commit to a behavior, which is unusual for APIs in std.

As a first quirk, assume an organization has site-wide use of a package that makes periodic calls to trace.Start/trace.Stop in the background. Then, assume there's a team that wishes to use flight recording for their single app. How can they know whether their SetPeriod/SetSize adjustments took effect? They can call trace.IsEnabled, but that won't say for sure what the status was at the instant they called trace.FlightRecorder.Start. Do they need explicit coordination with every other possible caller of trace.Start?

As for the two types of runtime/trace usage interfering in the other direction: How do the results of using the package-scoped tracer change in response to calling those methods on a FlightRecorder? Will short-term use of a flight recorder (which is Stopped and then discarded) have lasting results on the behavior of trace.Start later in the process's lifetime? What's the worst case here, from either very small or very large arguments to either of SetPeriod and SetSize? Are there benefits, such that an app owner might use the flight recorder as a way of configuring the package-scoped methods?

As a first quirk, assume an organization has site-wide use of a package that makes periodic calls to trace.Start/trace.Stop in the background. Then, assume there's a team that wishes to use flight recording for their single app. How can they know whether their SetPeriod/SetSize adjustments took effect? They can call trace.IsEnabled, but that won't say for sure what the status was at the instant they called trace.FlightRecorder.Start. Do they need explicit coordination with every other possible caller of trace.Start?

That is a very good point. This is something that hasn't come up in practice because it's not actually possible with the prototype to have both forms going simultaneously.

Relaxing the requirement of "all tracing must be stopped and started" is probably not too hard to add to the implementation. As long as there is only a single flight recorder, and a single window size, then we can define that window size once trace.Start has already been started. I have updated the API docs to reflect this.

As for the two types of runtime/trace usage interfering in the other direction: How do the results of using the package-scoped tracer change in response to calling those methods on a FlightRecorder? Will short-term use of a flight recorder (which is Stopped and then discarded) have lasting results on the behavior of trace.Start later in the process's lifetime? What's the worst case here, from either very small or very large arguments to either of SetPeriod and SetSize? Are there benefits, such that an app owner might use the flight recorder as a way of configuring the package-scoped methods?

I see what you're getting at, and I think the sketch implementation described in the original post couples the SetPeriod/SetSize parameters with the rate at which traceAdvance is called a little too much.

One way to resolve this, to decouple SetPeriod/SetSize from anything observable with trace.Start, is to always have the tracer operate at some minimum granularity. That is, traceAdvance is always called, say, once per second, and SetPeriod/SetSize essentially always operate in window sizes that are multiples of that once-per-second behavior. SetPeriod/SetSize are already defined such that they can be overridden by implementation restrictions (they're really only hints, but important hints). What you get from this is the ability to scale SetPeriod/SetSize as high as you want, but there is an internal minimum (and maximum granularity) that is always respected.

(Note: There currently exists a runtime GODEBUG to control that internal minimum which is used for testing and (unsurprisingly) debugging. I don't know yet if it will continue to exist.)

rhysh commented

have updated the API docs to reflect this.

// Adjustments to this value will not apply to an active flight recorder.

Looks good to me, thanks.

to always have the tracer operate at some minimum granularity [...] say, once per second, and SetPeriod/SetSize essentially always operate in window sizes that are multiples of that once-per-second behavior.

I think that sounds good, though I'm not familiar enough with the mechanics to know what else that would imply.

For comparison, runtime/pprof's CPU profiler involves a goroutine waking every 100 ms, and that leads to a sometimes 200 ms delay in stopping those profiles. Is there potential for that style of delay here?

When the flight recorder is not in use, does it have an implication on the volume of data that the runtime would stockpile before handing off to the io.Writer that was passed to trace.Start?

Say we have a very busy app, where typical transactions are under 200 ยตs and the slow ones we want to find take 5 ms. Maybe it's something like Memcached/Redis, or MySQL/Postgres. If an operator configures the flight recorder for that app and requests a small memory limit (maybe equivalent to what the app would generate over 50 ms), would they end up with an entire "minimum granularity of 1 second" window resident in memory? I think that's what you're implying, and I think it's ok.

Maybe an experience report from an owner of an app like that would inform the decision on that window size or could help to justify leaving the GODEBUG override. (My own experience has been with apps where 1 second is long enough to see most problems, and short enough to not cause new ones.)

Is there potential for that style of delay here?

No. If tracing stops early, from either perspective, we can always call traceAdvance immediately cut off the trace. I was mostly describing operation in a steady-state, where you've got both the flight recorder and tracer going continuously. Once either one asks for a cut off, we can always cut off immediately. The flight recorder internally will always have to maintain one extra complete window, so this is fine -- the guarantees of the API will still be met.

When the flight recorder is not in use, does it have an implication on the volume of data that the runtime would stockpile before handing off to the io.Writer that was passed to trace.Start?

Nope! In theory, the buffers can be handed off immediately to the writer, the question just becomes "who frees the buffer." If the flight recorder is active, then it may still "hold a reference" to the buffer. (I imagine we'll need some gentle reference counting scheme, or something, to get this right, but it's doable.)

Say we have a very busy app, where typical transactions are under 200 ยตs and the slow ones we want to find take 5 ms. Maybe it's something like Memcached/Redis, or MySQL/Postgres. If an operator configures the flight recorder for that app and requests a small memory limit (maybe equivalent to what the app would generate over 50 ms), would they end up with an entire "minimum granularity of 1 second" window resident in memory? I think that's what you're implying, and I think it's ok.

I also think that's what I'm implying, but I'm getting a bit thrown off by "small memory limit". ๐Ÿ˜… Do you mean a small argument to SetSize/SetPeriod? If so, then yes. If they configured the flight recorder to SetPeriod(50*time.Millisecond) the runtime would still always keep around 1 second of trace, provided it ran for at least 1 second. (There's an edge case where if you dump almost immediately after starting the flight recorder, we'll probably just dump whatever we have, rather than wait a full second.)

rsc commented

This discussion seems to be not relevant given the "only one tracer" limitation. I think we agree that if there are multiple tracers, then SetPeriod/SetSize apply to the one tracer they are called on, and not the others.

In discussion I was confused about "period" thinking it had something to do with sampling, which obviously it cannot. SetDuration might be a better name.

It should also be made clearer what happens if SetDuration/SetSize conflict. Does the bigger one win? And how does anyone know what kinds of sizes to pass? There is no documentation of the relationship between duration and bytes.

Maybe the API should be SetMaxDuration and SetMaxSize? Or SetMinDuration and SetMaxSize...? Or...?

Should they be dropped entirely?

Users need some way to understand how to use these, and it's unclear how to use them at the moment.

Or SetMinDuration and SetMaxSize...?

To expand on what this would mean: the buffer would be kept as small as possible to satisfy the min duration, but if that exceeds max size, then max size wins over the duration. That seems consistent with your comment about SetSize "limiting the damage" of SetDuration. If you just want as much trace as can fit in, say, 16 MiB, you'd SetMaxSize to 16 MiB and SetMinDuration to (effectively) infinity.

Should there be some way for the application to detect that the max size limit is capping the duration? A runtime metric tracking the current duration or size of the trace data in the buffers perhaps? Would such a metric be useful at all?

It should also be made clearer what happens if SetDuration/SetSize conflict. Does the bigger one win? And how does anyone know what kinds of sizes to pass? There is no documentation of the relationship between duration and bytes.

I'll just note that the original docs for SetSize do say:

// This generally takes precedence over the duration passed to SetPeriod.

(EDIT: That being said, all for clearer names and stricter semantics, personally.)

I support @aclements description of the semantics for SetMaxSize and SetMinDuration from #63185 (comment) and also renaming SetSize and SetPeriod to SetMaxSize (maybe SetMaxWindowSize?) and SetMinDuration (maybe SetMinWindowDuration?). If we do add Window it would be in service of describing more precisely which property of the flight recorder we're mutating. The flight recorder itself doesn't really have a "size" or "duration" property, but the trace window it maintains does.

One caveat I would like to add is that SetMaxSize and SetMinDuration should be clearly documented as best-effort. Making either of these hard requirements, I suspect, would lead to a lot of unnecessary complexity, and possibly some difficult performance edge cases. For example, if we're about to exceed the max size, do we stop-the-world of everything except the trace reader goroutine? By making these best-effort, we still get most of the utility without all of that.

I would also like to mention in the documentation that there is an internal minimum for the minimum duration, and that the setting might be ignored. If the minimum is too low, a simple implementation (call traceAdvance more frequently) will result in some serious and unexpected CPU overheads.

Finally, I agree we should add some guidance on how to set max size and min duration to the documentation. For min duration, how about:

A smaller minimum duration means lower memory overheads in exchange for less trace data. A good minimum duration
setting should be slightly longer than the duration of the longest event you want to capture in the trace, that would trigger a call to WriteTo. For example, if you are debugging RPC timeouts in a web service, WriteTo is called in response to a timeout, and the timeout is 30 seconds, then a minimum duration of 35 seconds would be appropriate.

For max size:

A smaller max size means lower memory overheads in exchange for less trace data. The purpose of max size is to place a soft memory limit on the amount of trace data held in memory at once, superseding the minimum duration, in case a large amount of trace data is produced in a short span of time. This value should be set according to the tolerances of your program's environment. For example, if your program uses 50 MiB at peak and operates within a 100 MiB container, then anywhere between 5-45 MiB would be appropriate. A good rule of thumb for calculating this limit is that average Go programs produce GOMAXPROCS MiB of trace data per second, though programs may produce more or less, depending on what they do.

If this sounds OK overall, I'll workshop it a bit more and put all this back up into the original proposal.

CC @cagedmantis

To expand on what this would mean: the buffer would be kept as small as possible to satisfy the min duration, but if that exceeds max size, then max size wins over the duration. That seems consistent with your comment about SetSize "limiting the damage" of SetDuration. If you just want as much trace as can fit in, say, 16 MiB, you'd SetMaxSize to 16 MiB and SetMinDuration to (effectively) infinity.

Based on this description of the duration limit and the one in #63185 (comment), it seems like we're describing a maximum duration. I.e. the flight recorder will hold on to no longer than the given duration of recent activity. FWIW I don't think a minimum duration makes sense for a flight recorder. If we have a minimum duration the buffer should hold, surely holding more than that duration would be acceptable? Meaning we'd have an unlimited flight recording if a size limit is not set?

That said, if I replace "minimum duration" with "maximum duration" then the descriptions and guidance in #63185 (comment) seem reasonable to me. In a continuous profiling context, we would likely use both. We'd have a duration limit of ~1 minute since that's how often we upload data, and it'd be a waste to upload substantially overlapping traces. And we'd have a size limit to prevent OOMing user programs, and keep bandwidth & storage costs under control.

For some prior art here, the Java JDK Flight Recorder (JFR) has maxage and maxsize parameters. These apply to the on-disk repository of flight recorder data, rather than the in-memory buffers, but the idea is the same: drop data older than maxage or the oldest data when the size of the repository exceeds maxsize. If both limits are set, whichever limit is met first wins. JFR doesn't set either limit by default, but I believe it's less of a firehose than the Go execution tracer, and it's going to disk while only keeping a relatively small amount buffered in memory.

If we have a minimum duration the buffer should hold, surely holding more than that duration would be acceptable? Meaning we'd have an unlimited flight recording if a size limit is not set?

I see your point, but I can still see the argument both ways. MinDuration makes sense to me because if you ask for a duration, you'd expect that at least this size of time window is visible in the resulting trace. MaxDuration meanwhile feels a bit like the runtime is likely to give you less. (It can give you less than the parameter asks for, if, for example, the program just hasn't been executing long enough.)

The window size is, IMO, a bit more of a target than it is a strict minimum or maximum. It's much fuzzier to me than MaxSize. We could call it SetTargetDuration, maybe?

I think where I got confused with Austin's comment was the relationship between duration and size in bytes. Rereading it, I think Austin is suggesting that SetMinDuration would be a hint to right-size the buffer. Like, given a target duration the flight recorder would fix a buffer size of N bytes that it thinks should fit that duration, and try not to ever use much more than N bytes. But as you noted in your suggested guidance for SetMaxSize, there's not a fixed relationship between duration and size. A busy HTTP server with lots of active connections will produce a lot more trace data than an idle one over the same duration. So I don't think SetMinDuration could work that way.

MinDuration makes sense to me because if you ask for a duration, you'd expect that at least this size of time window is visible in the resulting trace. MaxDuration meanwhile feels a bit like the runtime is likely to give you less. (It can give you less than the parameter asks for, if, for example, the program just hasn't been executing long enough.)

The window size is, IMO, a bit more of a target than it is a strict minimum or maximum. It's much fuzzier to me than MaxSize. We could call it SetTargetDuration, maybe?

I had pictured a MaxDuration(duration) implementation working by having the flight recorder hold on to all trace generations which ended less than duration ago, and drop generations older than that. This is what JFR basically does with the maxage parameter. So, it would not keep much more than the last duration of activity. But it would also keep at least the last duration of activity, so it kind of satisfies what you would want out of MinDuration(duration) while still being bounded. In that sense, yeah, it is more of a target. Is that the kind of limit/implementation you had in mind?

I had pictured a MaxDuration(duration) implementation working by having the flight recorder hold on to all trace generations which ended less than duration ago, and drop generations older than that.

The original implementation I had in mind would always keep 2 generations around (the currently-accumulating one and the previous one), but adjust the rate at which generations are produced based on the duration and size parameters. It would track how many bytes live in buffers and force the creation of a new generation in response to MaxSize. Generally speaking, I suspect this will be easier to implement. This is different from the experimental flight recorder in golang.org/x/exp/trace, which behaves more like you describe.

One downside is that this implementation strategy has visible effects on concurrent calls to trace.Start, which is a negative. They're mostly inconsequential, except that it raises the memory requirement for parsing traces. In theory MaxSize will offset any such issues.

Your proposed implementation is also reasonable, and it might be helpful to decouple the internal generation policy from the flight recorder policy. But it also means the duration and size parameters will be less precise, always rounded up to the minimum granularity the runtime exposes.

I believe it would be backwards compatible (since the wire format doesn't follow compatibility guarantees) to change the implementation if one or the other is a mistake, so we can experiment and start with whatever's easier.

The original implementation I had in mind would always keep 2 generations around (the currently-accumulating one and the previous one), but adjust the rate at which generations are produced based on the duration and size parameters.

Does this mean that during the WriteTo call (if the write takes long enough to fill active generation to the full size), the approximate amount of memory used can be 2 * MaxSize? Should the guidance about setting MaxSize take this case into account?

@martin-sucha No. In my mind, MaxSize would simply take into account all outstanding trace data, and once that exceeded the threshold, a new generation would be produced (allowing the oldest one to be scrapped). And this would supersede the duration.

In the diagnostics meeting today, we discussed SetMinDuration and SetMaxSize a little bit, and also how it interacts with trace.Start. Given this comment from @rsc, which I do agree with:

I think we agree that if there are multiple tracers, then SetPeriod/SetSize apply to the one tracer they are called on, and not the others.

I think the API in this proposal is not quite right. Having different configurations for different subscribers of trace data is going to be very complex to implement in the general case. It's going to be tricky for duration and size, but even worse if we consider things like opting into specific events in the future (filtering the event stream effectively requires parsing it, which is much slower than emitting it).

I would like to propose an alternative API that makes these knobs, MinDuration and MaxSize global properties, and describes how they interact with other trace subscribers. So, rather than define a new configuration type called FlightRecorder, I propose we define a new Trace type. This type then sets one single global configuration but can have multiple subscribers. Here's a sketch:

package runtime/trace

// Trace represents a configuration for the tracer.
type Trace struct {
    // unexported fields
}

// SetSyncPeriod controls the time between trace synchronization points which become split
// points in the trace. As a result, it also controls the window size for the flight recorder. The
// flight recorder endeavors to always retain one full sync period worth of trace data in addition
// to the data being accumulated for the current sync period.
//
// This setting is best-effort and approximate, and in most cases the implementation-defined
// default is good enough. 
//
// Note: Put another way, this controls the period between EventSync events that appear in the
// trace, as defined by golang.org/x/exp/trace.
//
// TODO: Provide more guidance on how to set this.
func (t *Trace) SetSyncPeriod(dur time.Duration)

// SetSyncPeriodSizeLimit sets a limit on how much trace data can be accumulated in a sync
// period, forcing the creation of a new synchronization point when the limit is exceeded, even
// if the period has not reached the duration set by SetSyncPeriod.
//
// TODO: Provide more guidance on how to set this.
func (t *Trace) SetSyncPeriodSizeLimit(bytes int64)

func (t *Trace) StartFlightRecorder() // Formerly FlightRecorder.Start
func (t *Trace) StopFlightRecorder() // Formerly FlightRecorder.Stop
func (t *Trace) WriteFlightRecording(io.Writer) (n int64, err error) // Formerly FlightRecorder.WriteTo.

// Subscribe registers a subscriber that begins ingesting trace data and writing it to w.
// If tracing is not yet enabled when Subscribe is called, then Subscribe will enable tracing.
//
// Only one instance of Trace may enable tracing at a time, so Subscribe will fail and return
// an error if another instance of Trace already has subscribers.
func (t *Trace) Subscribe(w io.Writer) (Subscription, error)

type Subscription struct {
    // unexported fields
}

// Close deregisters a subscriber ingesting trace data. If this is the last subscriber and flight
// recording is disabled, then tracing will be disabled.
func (s Subscription) Close()

This API is substantially different, but gets at the heart of what the size and duration parameters really represent, and makes them global settings, so there's no conflict between subscribers. It also presents a coherent semantics for these durations with respect to regular tracing and makes them global. It also aligns even more closely with the existing proposal for CPU profiling configuration (#42502). Lastly, it allows for multiple trace subscribers (that is, the equivalent of multiple calls to trace.Start), which is outside the scope of this proposal, but I think it's worth considering all these things together when trying to decide on a configuration API.

My one concern with this API is that the sync period is arguably be an implementation detail, but I think we've reached a point where it cannot. Without the concept of a synchronization point in the trace, many important properties of the new trace format simply cannot be relied upon (like the fact that large traces can be parsed using a reasonable amount of memory). While most of the time users do not need to be aware of this at all, for those working with and integrating with traces deeply, these details matter a lot.

If this API sketch is of interest to those on this issue, I will file a separate proposal to define:

type Trace struct {
    // unexported fields
}

func (t *Trace) Subscribe(w io.Writer) (Subscription, error)

type Subscription struct {
    // unexported fields
}

func (s Subscription) Close()

Extending the tracer for multiple subscribers is actually not that hard to implement. The trace reader goroutine just needs to spit trace data out into multiple io.Writers, effectively acting as a MultiWriter itself. The advantage of the subscription mechanism is that different tracing sources can come in an out concurrently at different times. This means that net/http/pprof need not return an error if someone tries to grab a trace manually while, say, the a diagnostics suite performs occasionally continuous monitoring duties by enabling tracing briefly. But, we can also cut down this API to allow only one subscriber. That would be OK with me too.

rhysh commented

This means that net/http/pprof need not return an error if someone tries to grab a trace manually while, say, the a diagnostics suite performs occasionally continuous monitoring duties by enabling tracing briefly.

// Only one instance of Trace may enable tracing at a time, so Subscribe will fail and return
// an error if another instance of Trace already has subscribers.

How does net/http/pprof end up sharing the same *trace.Trace value that's in use by the continuous-monitoring diagnostics suite?


// Note: Put another way, this controls the period between EventSync events that appear in the
// trace, as defined by golang.org/x/exp/trace.

I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?

average Go programs produce GOMAXPROCS MiB of trace data per second

This sounds helpful .. we expect to have an OS thread for each GOMAXPROCS, and for each OS thread to have a pretty big system stack. (Is that true for all threads, or only for the first one?) Can we convert that into justification for a general-purpose default for the sync interval (like "if we sync every 100 ms, that's 10% more memory usage than you'd get from the OS threads' stacks themselves, even if the app didn't have its own memory needs").

Maybe I still don't understand the root we're trying to get to. It seems like every consumer of the execution trace data stream can make its own decisions about when to discard data (because it's more bytes than the consumer has in its budget, or because it covers more seconds that the consumer has been told it needs to record). But the one thing that the consumers need to agree on is how often the runtime will generate the sync events: that's the source of the need for some global config (or mutual exclusion). Is that right?

How does net/http/pprof end up sharing the same *trace.Trace value that's in use by the continuous-monitoring diagnostics suite?

Yeah, that's a good point; I don't have a good answer. We could have net/http/pprof expose a *trace.Trace I suppose. I also wonder how that works for the CPUProfile configuration.

// Note: Put another way, this controls the period between EventSync events that appear in the
// trace, as defined by golang.org/x/exp/trace.

I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?

You raise a good point. The documentation should state this, but yes, there's a CPU cost to generating synchronization events more frequently. The cost is roughly proportional to the number of Ms and Gs. Off the top of my head, I don't know what the cost is, but it's certainly measurable.

This sounds helpful .. we expect to have an OS thread for each GOMAXPROCS, and for each OS thread to have a pretty big system stack. (Is that true for all threads, or only for the first one?) Can we convert that into justification for a general-purpose default for the sync interval (like "if we sync every 100 ms, that's 10% more memory usage than you'd get from the OS threads' stacks themselves, even if the app didn't have its own memory needs").

The cost of synchronization events is proportional to not only OS threads, but the number of goroutines (since we record information about even waiting goroutines too). I'm not totally sure I follow your "10% more memory" example though; can you explain more?

Maybe I still don't understand the root we're trying to get to. It seems like every consumer of the execution trace data stream can make its own decisions about when to discard data (because it's more bytes than the consumer has in its budget, or because it covers more seconds that the consumer has been told it needs to record). But the one thing that the consumers need to agree on is how often the runtime will generate the sync events: that's the source of the need for some global config (or mutual exclusion). Is that right?

Yeah, that's mostly right.

I think in a hypothetical future scenario where we want to optionally expose "expensive" events (like, say, an allocation trace, which is currently behind a GODEBUG), then I think we also want all consumers to agree to opting into that. In this particular example, the cost will be global, and it will be substantial, and filtering out these "expensive" events will be even more expensive. Of course, as you say, consumers can do downstream filtering instead. But I think there's some value to controlling filtering (or really, adding new events) upstream. Again, this is all hypothetical at the moment, but I think it's a positive that the new API plays nicely with this possible future scenario.

I propose we define a new Trace type. This type then sets one single global configuration

How does one create a Trace? Just new(trace.Trace)?

If it has a global nature, why not make the methods package-level functions, or make a global variable of Trace? I guess one reason could be that some code creates a *Trace (say t = new(trace.Trace)), other unrelated code in the same program does not have access to t, so it cannot do anything with it, like changing its configurations? Is this the intention?

And, how does the unrelated code know if there is already an instance of Trace with subscribers, so it won't try to create another one which will be useless anyway? Will the package-level IsEnabled function return true in this case?

+1 to making the global nature of *Trace more explicit as suggested by @cherrymui.

I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?

@rhysh in addition to the CPU costs mentioned in @mknyszek's reply, I think it's also important to consider the size overhead. More EventSync events means more data wasted on duplicating strings and stack traces across different generations.

@mknyszek I have a question about WriteFlightRecording. Can you clarify what happens when calling this method? Does it write out the previous generation AND the current generation that is in progress? Does it finish (sync) the the current generation?

+1 to making the global nature of *Trace more explicit as suggested by @cherrymui.

Cherry and I spoke offline and I agree, I think the new API I wrote above is not quite right. Apologies for the confusion.

I think where we landed was that we should continue with the API as-proposed in the original post, but that the implementation's decisions on when to cut a new generation should be decoupled from the flight recorder parameters (size and duration).

That being said, we can consider having a separate top-level API that controls the size and duration between sync events. Like: (This is definitely incomplete.)

package trace

// SetSyncRate sets the rate at which the trace implementation resynchronizes the internal tracer state.
// The rate is expressed as a duration and a size in bytes. The tracer is synchronized each time the period
// elapses, or the amount of trace data produced since the last sync exceeds size bytes, whichever comes
// first.
//
// More frequent synchronization results in a bigger overall trace (because of more duplicate trace data),
// higher run-time CPU overheads when tracing is enabled, and lower run-time memory overheads. Less
// frequent synchronization results in the opposite.
func SetSyncRate(period time.Duration, sizeBytes int64)

I think the original sin that spawned this whole discussion was conflating the trace implementation's size/duration with the flight recorder's size/duration. It's very convenient from an implementation perspective to couple the flight recorder's size/duration with the trace's underlying size/duration, but I think it ultimately doesn't make much sense from an API perspective.

I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?

@rhysh in addition to the CPU costs mentioned in @mknyszek's reply, I think it's also important to consider the size overhead. More EventSync events means more data wasted on duplicating strings and stack traces across different generations.

I think this question really gets at the heart at why this question should be decoupled from flight recording. With flight recording, you care predominantly about how much trace data you're keeping around, and you want to control that. If we want to have a knob for the performance trade-off, we should have a separate knob for that.

@mknyszek I have a question about WriteFlightRecording. Can you clarify what happens when calling this method? Does it write out the previous generation AND the current generation that is in progress? Does it finish (sync) the the current generation?

Yes. In both the original proposal and the (now defunct) new one, a request to dump flight recorder data will always finish the current generation to ensure that the most recent data ends up in the trace that is produced.

During the diagnostic sync on Thursday I mentioned an API idea for ensuring that two libraries trying to control a global runtime setting don't end up interfering with each other in a problematic way. This might be less of a concern here, but I'm also thinking ahead to future APIs for controlling which trace events to emit or suppress.

Below is a potential API pattern that could work:

// SyncRate holds the rate at which the trace implementation resynchronizes the internal tracer state.
// The rate is expressed as a duration and a size in bytes. The tracer is synchronized each time the period
// elapses, or the amount of trace data produced since the last sync exceeds size bytes, whichever comes
// first.
//
// More frequent synchronization results in a bigger overall trace (because of more duplicate trace data),
// higher run-time CPU overheads when tracing is enabled, and lower run-time memory overheads. Less
// frequent synchronization results in the opposite.
type SyncRate struct {
	Period    time.Duration
	SizeBytes int64
}

// ControlSyncRate gives the caller control over the sync rate until release is called. Only one caller can
// control the sync rate at a time. If another caller tries to control the sync rate, an error is returned.
func ControlSyncRate() (control func(SyncRate), release func(), err error)

// ReadSyncRate returns the current sync rate.
func ReadSyncRate() SyncRate

In the case that ControlSyncRate returns an error, we could also include the stack trace of the caller that currently has control over the rate, making it very easy to diagnose potential configuration conflicts.

I'd also like to include ReadSyncRate for two reasons: a) It's useful for tests that want to set a rate and then restore the previous one. b) It allows libraries to better adjust their behavior in case they are not able to control a setting.

Let me know what you think. ๐Ÿ™‡

Change https://go.dev/cl/555716 mentions this issue: gopls/internal/debug: integrate flight recorder

Apologies for the very long delay here.

Since a lot of the remaining questions here touch on our overall strategy and treatment of diagnostics, and how they interact with multiple consumers, I brought together a few people internally to discuss. We can use these points as starting points in the next diagnostics weekly meeting. Here's what we're thinking.

  • It was a mistake that diagnostics are global and single-consumer. Overall, we want to move toward a world where all diagnostics support multiple consumers where possible.
  • At the moment, we're thinking that consumers state their minimum requirements of a diagnostic API, and the diagnostic internals meet all minimums. This means that if, in some hypothetical future, we have options on a trace to support different kinds of events, we will always collect the union of the sets of events requested by everyone, and everyone gets a copy of the union set.
  • There are certain cases where minimum requirements aren't good enough. For example, a CPU profile can only have one constant rate in the pprof format. So, we need to do some work to make sure everyone gets the rate they expected, but internally the profiling rate would be the maximum of all rates.
  • We definitely want user control over flight recorder buffer size, independent of runtime details. This will be what the flight recorder API exposes. We have a choice as to whether or not this participates in the "minimum requirements" landscape described above (that is, you get the minimum requirements requested of all flight recorders). (Note: "minimum requirements" doesn't necessarily mean an actual minimum, just the most stringent requirements.)

Given all of this, I have made some minor updates to the API at the top of this proposal based on the feedback provided in this issue, and I think we should move forward with that.

@felixge With respect to the sync rate knob, I'm currently thinking trace sync rate should probably be a separate discussion and a separate proposal. (That isn't to say we won't follow up, just that it doesn't need to block this proposal.) It's also not totally clear to me that we should have "Period" in there at all. It's about controlling CPU and memory overheads, so it should control them more directly. Given the above, I think it would be good to fold this into the "minimum requirements" idea instead of designating one goroutine and/or package as the knob owner. I am not totally sure what this would look like yet, but I'll give it some thought. We can also discuss this next week.

I'll add this to the diagnostics meeting agenda for next week and I'll report back on what we land on.

@nsrip-dd I thought more about MaxDuration vs. MinDuration. I really do see your point, but I updated the proposal at the top to use MinAge instead of MaxAge. Here's my reasoning.

If we think of MinAge and MaxBytes as minimum requirements, then logically it works out clearer with this naming scheme. Across all flight recorders, we would take the maximum MinAge, and the minimum MaxBytes, as these are the most stringent requirements, and apply these to the implementation.

Also, in the context of a user trying to decide what to do, I think MinAge makes a bit more sense to me. If I'm a user that wants to debug an RPC timeout, and the timeout is 10 seconds, then I might want to make sure that the snapshot contains at least 15 seconds of data. So the oldest event is at least 15 seconds old.

That being said, I recognize these points are somewhat marginal, but we need to make a decision one way or the other, and this is enough to tip the scales for me.

// Adjustments to this value will not apply to an active flight recorder.
func (*FlightRecorder) SetMinAge(d time.Duration)

Should it return an error in this case, then? Or we're okay with just a silent no-op?

rsc commented

If we are worried about setters being called after Start, one option would be to define a separate FlightConfig struct that has the fields we need and then have NewFlightRecorder take a *FlightConfig.

Sorry for not posting this sooner, but to follow up on my last message, we reached consensus at a couple diagnostics meetings ago on the API.

Should it return an error in this case, then? Or we're okay with just a silent no-op?

That's a good point. The parameter is kind of a hint anyway, so I figured a silent no-op was OK. However...

If we are worried about setters being called after Start, one option would be to define a separate FlightConfig struct that has the fields we need and then have NewFlightRecorder take a *FlightConfig.

Yeah, this is something we could certainly do, and just make the parameters immutable. It's a little bit awkward since the zero value becomes meaningful, whereas with the setters we get to say the default is implementation-defined. Unless we say the zero value of any parameter indicates that the caller just wants the implementation default.

It occurs to me #42502 (comment), which is an accepted proposal, has its setter return an error, as @cherrymui suggested. So maybe we should align with that? There's also no constructor for the type.

There's also no constructor for the type.

We probably should have a constructor function, so it is clear how to create a FlightRecorder. Or we should document that new(FlightRecorder) just works.

I generally dislike having "typestate" in an API, so I agree with the direction of separating configuration and running.

Unless we say the zero value of any parameter indicates that the caller just wants the implementation default.

I think this would be fine.

So this would look something like

package trace

type FlightRecorderConfig struct {
	// MinAge is a lower bound on the age of an event in the flight recorder's window.
	//
	// The flight recorder will strive to promptly discard events older than the minimum age,
	// but older events may appear in the window snapshot. The age setting will always be
	// overridden by MaxSize.
	//
	// If this is 0, the minimum age is implementation defined, but can be assumed to be on the order
	// of seconds.
	MinAge time.Duration

	// MaxSize is an upper bound on the size of the window in bytes.
	//
	// This setting takes precedence over MinAge.
	// However, it does not make any guarantees on the size of the data WriteTo will write,
	// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
	// as a hint.
	//
	// If this is 0, the maximum size is implementation defined.
	MaxBytes uint64
}

type FlightRecorder struct {
    ...
}

// NewFlightRecorder creates a new flight recorder.
func NewFlightRecorder(config FlightRecorderConfig) *FlightRecorder

func (*FlightRecorder) Start() error
func (*FlightRecorder) Stop() error
func (*FlightRecorder) Enabled() bool
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error)

@nsrip-dd and I spend some time to go over the latest proposal again. It looks good to us, but we'd like a bit more color on the operation of `WriteTo():

  • Does it cut off an ongoing generation immediately, or does it wait until it finishes? We'd prefer the former as it would make flight recording more useful for the "something bad happened and we might be crashing soon" use case.
  • What happens when calling WriteTo() on the same flight recorder twice? Do the two recordings overlap? And is that behavior the same when multiple flight recorders are used?
  • Does it cut off an ongoing generation immediately, or does it wait until it finishes? We'd prefer the former as it would make flight recording more useful for the "something bad happened and we might be crashing soon" use case.

Yes, it cuts off an on-going generation immediately so you have up-to-date data. I don't think that's necessary to document, really, since I think that's the most intuitive and useful behavior it could have. Though I guess we could say something like "best-effort up-to-date data as of when WriteTo returns" so nobody has to second-guess it.

  • What happens when calling WriteTo() on the same flight recorder twice? Do the two recordings overlap? And is that behavior the same when multiple flight recorders are used?

Calling multiple WriteTo's from different flight recorders is fine in general, the only problem is that they have to queue behind each other to cut a new generation, though we could be a bit smarter about that. For example if traceAdvance is just about to cut a new generation, then both calls can count that as a recent enough cut. But I think if traceAdvance is past the cut-off point (updating trace.gen) then one should probably wait for the other, because it's more likely to be stale. (This is kind of how concurrent calls to runtime.GC work.)

Multiple WriteTo's are currently disallowed on the same FlightRecorder (it's a returned error) for two reasons.
1. I think of a single FlightRecorder as "targeting" some kind of specific event. So returning an error allows deduplicating multiple WriteTo's covering the same time period for the same FlightRecorder (take the first one). (I'm not sure how useful this is, but it seems useful to not accidentally generate N very similar dumps all at the same time.)
2. To avoid complicated synchronization on the FlightRecorder's internals.

@mknyszek , would you mind posting the latest version of the full proposed API as a self-contained comment? Thanks!

@mknyszek , would you mind posting the latest version of the full proposed API as a self-contained comment? Thanks!

Bump ๐Ÿ˜Š

package trace

// FlightRecorder represents a single consumer of a Go execution
// trace.
// It tracks a moving window over the execution trace produced by
// the runtime, always containing the most recent trace data.
//
// At most one flight recorder may be active at any given time,
// though flight recording is allowed to be concurrently active
// with a trace consumer using trace.Start.
// This restriction of only a single flight recorder may be removed
// in the future.
type FlightRecorder struct {
    ...
}

// NewFlightRecorder creates a new flight recorder from the provided configuration.
func NewFlightRecorder(cfg FlightRecorderConfig) *FlightRecorder

// Start activates the flight recorder and begins recording trace data.
// Currently, only one flight recorder and one call to trace.Start may be
// active simultaneously.
// This restriction may be lifted in the future.
// Returns an error if Start is called multiple times on the same
// FlightRecorder, or if Start would cause there to be more
// simultaneously active trace consumers than is currently supported.
func (*FlightRecorder) Start() error

// Stop ends recording of trace data. It blocks until any concurrent WriteTo calls complete.
func (*FlightRecorder) Stop()

// Enabled returns true if the flight recorder is active.
// Specifically, it will return true if Start did not return an error, and Stop has not yet been called.
// It is safe to call from multiple goroutines simultaneously.
func (*FlightRecorder) Enabled() bool

// WriteTo snapshots the moving window tracked by the flight recorder.
// The snapshot is expected to contain data that is up-to-date as of when WriteTo is called,
// though this is not a hard guarantee.
// Only one goroutine may execute WriteTo at a time. 
// An error is returned upon failure to write to w, if another WriteTo call is already in-progress,
// or if the flight recorder is inactive.
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error)

type FlightRecorderConfig struct {
	// MinAge is a lower bound on the age of an event in the flight recorder's window.
	//
	// The flight recorder will strive to promptly discard events older than the minimum age,
	// but older events may appear in the window snapshot. The age setting will always be
	// overridden by MaxSize.
	//
	// If this is 0, the minimum age is implementation defined, but can be assumed to be on the order
	// of seconds.
	MinAge time.Duration

	// MaxBytes is an upper bound on the size of the window in bytes.
	//
	// This setting takes precedence over MinAge.
	// However, it does not make any guarantees on the size of the data WriteTo will write,
	// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
	// as a hint.
	//
	// If this is 0, the maximum size is implementation defined.
	MaxBytes uint64
}

Looks good, thanks! Just a nitpick, there is MaxSize in the doc comment, but MaxBytes in the field name.

I wonder what a user would do with the error from Start and Stop? Could "redundant" calls just be ignored, avoiding the need to return an error? Users can always check if it is running with Enabled.

Users can always check if it is running with Enabled.

Only if they synchronize between all goroutines that might wish to call Start, else you have a race condition.

I've missed the latest revision of the API and that Enabled is a method on a specific instance of the flight recorder.

// Each FlightRecorder should represent some specific behavior
// that the caller is trying to capture about their program.
// For instance, recording a trace of a slow RPC response from a
// specific endpoint.

I'm not sure what this means. Given that you snapshot a flight recorder after something unexpected has happened, so it needs to already be running, so how can you have a flight recorder for each behavior?

// Start activates the flight recorder and begins consuming trace data.

"Consuming" seems like an implementation detail. "Recording"?

I wonder what a user would do with the error from Start and Stop? Could "redundant" calls just be ignored, avoiding the need to return an error? Users can always check if it is running with Enabled.

I could imagine other error categories from Start but I'm not sure what you do with an error from Stop. Maybe it could just be a silent no-op if the recorder isn't started?

I'm not sure what this means. Given that you snapshot a flight recorder after something unexpected has happened, so it needs to already be running, so how can you have a flight recorder for each behavior?

I think there's a misunderstanding here. It was just supposed to be some advice as to how to actually use it, but it imagines a world where you can have multiple independent FlightRecorder instances all enabled. When some exceptional behavior occurs, you have to decide which FlightRecorder to call WriteTo, and the idea is you have one FlightRecorder instance to track different unexpected behaviors, so that they compose. This way, independent packages don't have to think about finding the single global FlightRecorder to call WriteTo, but rather all create their own FlightRecorder. (Under the hood all these independent FlightRecorder instances could share a ring buffer whose parameters are maximally permissive across all instances.)

Anyway, such a world will not exist to begin with, so I'll remove that from the API doc.

"Consuming" seems like an implementation detail. "Recording"?

Sure, will update.

I could imagine other error categories from Start but I'm not sure what you do with an error from Stop. Maybe it could just be a silent no-op if the recorder isn't started?

I agree. I'll remove the error.

I have updated #63185 (comment) to include the latest feedback.

I wonder what a user would do with the error from Start and Stop? Could "redundant" calls just be ignored, avoiding the need to return an error? Users can always check if it is running with Enabled.

For Stop, I agree. For Start, I think it's better to notify when that fails ASAP. WriteTo can fail instead, or we can ask people to check Enabled, but that feels more awkward than just letting callers know immediately that this flight recorder won't work. (There are also other errors that can occur in Start, such as calling Start twice on the same FlightRecorder. We could be loose about that, but I don't see any particular advantage to doing so. Requiring matching Start and Stop calls feels less messy for everyone. I updated the doc to mention it.)

In Start, the documentation is currently pretty strict about it returning an error if a flight recorder is already active. Since we may want to lift that restriction in the future, let's be less strict about that wording.

I loosened up the language in the proposed API docs for Start. It's a little clunky but I think it gets the point across, and we can workshop it in review.

A minor point: would it be better for FlightRecorderConfig.MaxBytes to be uintptr? The value shouldn't be larger than the address space. On the other hand, uint64 matches WriteTo.

Based on the discussion above, this proposal seems like a likely accept.
โ€” aclements for the proposal review group

The proposal is to add support for a trace flight recorder to the trace package:

package trace

// FlightRecorder represents a single consumer of a Go execution
// trace.
// It tracks a moving window over the execution trace produced by
// the runtime, always containing the most recent trace data.
//
// At most one flight recorder may be active at any given time,
// though flight recording is allowed to be concurrently active
// with a trace consumer using trace.Start.
// This restriction of only a single flight recorder may be removed
// in the future.
type FlightRecorder struct {
    ...
}

// NewFlightRecorder creates a new flight recorder from the provided configuration.
func NewFlightRecorder(cfg FlightRecorderConfig) *FlightRecorder

// Start activates the flight recorder and begins recording trace data.
// Only one call to trace.Start may be active at any given time.
// In addition, currently only one flight recorder may be active in the program.
// Returns an error if the flight recorder cannot be started or is already started.
func (*FlightRecorder) Start() error

// Stop ends recording of trace data. It blocks until any concurrent WriteTo calls complete.
func (*FlightRecorder) Stop()

// Enabled returns true if the flight recorder is active.
// Specifically, it will return true if Start did not return an error, and Stop has not yet been called.
// It is safe to call from multiple goroutines simultaneously.
func (*FlightRecorder) Enabled() bool

// WriteTo snapshots the moving window tracked by the flight recorder.
// The snapshot is expected to contain data that is up-to-date as of when WriteTo is called,
// though this is not a hard guarantee.
// Only one goroutine may execute WriteTo at a time. 
// An error is returned upon failure to write to w, if another WriteTo call is already in-progress,
// or if the flight recorder is inactive.
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error)

type FlightRecorderConfig struct {
	// MinAge is a lower bound on the age of an event in the flight recorder's window.
	//
	// The flight recorder will strive to promptly discard events older than the minimum age,
	// but older events may appear in the window snapshot. The age setting will always be
	// overridden by MaxSize.
	//
	// If this is 0, the minimum age is implementation defined, but can be assumed to be on the order
	// of seconds.
	MinAge time.Duration

	// MaxBytes is an upper bound on the size of the window in bytes.
	//
	// This setting takes precedence over MinAge.
	// However, it does not make any guarantees on the size of the data WriteTo will write,
	// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
	// as a hint.
	//
	// If this is 0, the maximum size is implementation defined.
	MaxBytes uint64
}

A minor point: would it be better for FlightRecorderConfig. MaxBytes to be uintptr? The value shouldn't be larger than the address space. On the other hand, uint64 matches WriteTo.

My feeling is that we tend to use uint64 or int64 to mean bytes, which is why I picked this. Then again, when used in some interfaces, it needs to be larger than the address space because IIUC you can have a >4 GiB file on disk on 32-bit systems.

And of course you're right, this is totally in-memory. int or uint is a little friendlier than uintptr and works similarly today.

I don't feel very strongly about this either way, since I don't have a strong sense of what the norms are. Signed integers are a little annoying because then we have to say what negative means (probably also "implementation defined"), but we use int everywhere already and it's not a big deal.

I don't feel strongly about it either. uint64 is probably fine.

No change in consensus, so accepted. ๐ŸŽ‰
This issue now tracks the work of implementing the proposal.
โ€” aclements for the proposal review group

The proposal is to add support for a trace flight recorder to the trace package:

package trace

// FlightRecorder represents a single consumer of a Go execution
// trace.
// It tracks a moving window over the execution trace produced by
// the runtime, always containing the most recent trace data.
//
// At most one flight recorder may be active at any given time,
// though flight recording is allowed to be concurrently active
// with a trace consumer using trace.Start.
// This restriction of only a single flight recorder may be removed
// in the future.
type FlightRecorder struct {
    ...
}

// NewFlightRecorder creates a new flight recorder from the provided configuration.
func NewFlightRecorder(cfg FlightRecorderConfig) *FlightRecorder

// Start activates the flight recorder and begins recording trace data.
// Only one call to trace.Start may be active at any given time.
// In addition, currently only one flight recorder may be active in the program.
// Returns an error if the flight recorder cannot be started or is already started.
func (*FlightRecorder) Start() error

// Stop ends recording of trace data. It blocks until any concurrent WriteTo calls complete.
func (*FlightRecorder) Stop()

// Enabled returns true if the flight recorder is active.
// Specifically, it will return true if Start did not return an error, and Stop has not yet been called.
// It is safe to call from multiple goroutines simultaneously.
func (*FlightRecorder) Enabled() bool

// WriteTo snapshots the moving window tracked by the flight recorder.
// The snapshot is expected to contain data that is up-to-date as of when WriteTo is called,
// though this is not a hard guarantee.
// Only one goroutine may execute WriteTo at a time. 
// An error is returned upon failure to write to w, if another WriteTo call is already in-progress,
// or if the flight recorder is inactive.
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error)

type FlightRecorderConfig struct {
	// MinAge is a lower bound on the age of an event in the flight recorder's window.
	//
	// The flight recorder will strive to promptly discard events older than the minimum age,
	// but older events may appear in the window snapshot. The age setting will always be
	// overridden by MaxSize.
	//
	// If this is 0, the minimum age is implementation defined, but can be assumed to be on the order
	// of seconds.
	MinAge time.Duration

	// MaxBytes is an upper bound on the size of the window in bytes.
	//
	// This setting takes precedence over MinAge.
	// However, it does not make any guarantees on the size of the data WriteTo will write,
	// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
	// as a hint.
	//
	// If this is 0, the maximum size is implementation defined.
	MaxBytes uint64
}

Change https://go.dev/cl/674995 mentions this issue: runtime/trace: modifies how batches are read and processed

Change https://go.dev/cl/673117 mentions this issue: runtime/trace: copy the flight recorder implementation over

Change https://go.dev/cl/673118 mentions this issue: runtime/trace: add the ability to subscribe to tracing

Change https://go.dev/cl/673116 mentions this issue: runtime/trace: add the flight recorder

Change https://go.dev/cl/675297 mentions this issue: runtime/trace: stop reserializing the batches

With 83df0af, the exp/trace https://go.googlesource.com/exp/+/refs/heads/master/trace/flightrecorder_test.go#91 test TestFlightRecorderConcurrentWriteTo now deadlocks in traceAdvance.

Change https://go.dev/cl/677355 mentions this issue: trace: regenerate to include CL 673116

Change https://go.dev/cl/677262 mentions this issue: cmd/trace: add -exitstdin flag

Change https://go.dev/cl/677695 mentions this issue: internal/trace: expose the go version read by the reader

Change https://go.dev/cl/677976 mentions this issue: _content/doc: add release note for the flight recorder