nodejs/diagnostics

Node CPU Profiling Roadmap

yunong opened this issue · 79 comments

We’d like to get feedback on the status of the profiling tools landscape in Node.js today. In particular -- we want to get alignment on a roadmap which will provide a free, open source, and cross-platform set of tools that are part of the node/v8 API i.e. maintained across LTS versions that can provide a powerful suite to debug and diagnose Node.js issues in production.

Production Challenges

There are some challenges that are unique to debugging and diagnosing issues in production. Specifically for large critical production deployments. In particular here are some of the constraints due to the production nature of the deployments:

  1. Generally, most solutions need to be low impact on the process. Attaching debuggers or tracers is often impractical when the application is taking production traffic. Debuggers will pause the process, causing all inflight requests to hang. Tracers and debuggers often introduce large performance penalties, which can impact customers.
  2. Tools need to be safe and not cause kernel panics or process crashes.
  3. It’s often hard to reproduce issues seen at production scale -- therefore programmatic access of tooling via APIs can help diagnose problems as they occur. e.g. start a profile based on some condition since it’s often impossible to reproduce

Profiling

One of the most useful methodologies to optimize CPU performance in a running application is by sampling the CPU stack frames (CPU profiling) and then visualizing the samples, typically using a flamegraph. This technique will show hot code paths on CPU -- which gives you the opportunity to optimize the relevant source code.

The approach can be done in production with OS level profilers such as (perf, DTrace, systemtap, eBPF) with very low overhead. The profilers lack the information to resolve the JS frames, resulting in unhelpful memory addresses for the JS frames. V8 solves this problem by dumping a mapping of native frame addresses to JS source and line number.

It’s important to mention here that having access to all stack frames, whether native (v8, libc, syscalls, libuv, native modules) or JS is important. Problems can occur anywhere in the stack, and we want to be able to profile Node with complete stack frames. E.g. We heavily use gRPC -- which is a native module -- so without access to native frames we would not be able to get visibility into this critical part of our application.

There are a few issues with this implementation:

  1. perf(1) support is now deprecated in V8 and will not be supported starting in Node 8 -- which effectively means we’re losing the ability to profile JS stacks.
  2. Lack of cross-platform tooling. Even if perf support wasn’t deprecated, this solution only works on Linux.

We’d like to contribute and collaborate on a set of comprehensive, cross-platform, and open source CPU profiling tools with the Node and V8 team. The V8 team has advised us that they plan to support the v8 profiler and the v8 cpu profiler API going forward, and we want to unlock CPU profiling capabilities in Node using these supported frameworks.

Roadmap:

  1. Determine which tools -- v8 profiler (--prof), cpu profiler or something else -- to support and maintain for profiling Node applications.
  2. Currently, these tools do not support native c/c++ frames which includes v8, native modules, libuv or syscall stack frames. Note because these tools are in process, they will never be able to show syscalls as those can only be sampled from within the kernel. Supporting native frames will require implementing a stackwalker in the v8 profiler or the cpu profiler.
  3. Determine the overhead of profiling using these tools. With the now deprecated OS level profilers, the overhead was quite low -- and thus suitable to use in a production environment without outsized impact to customers.
  4. Dynamically toggle profiling in a production environment. It’s not practical to profile all the time in production due to the performance impact -- being able to dynamically enable and disable profiling via an API would be ideal.
  5. Provide an API to programmatically consume profiling data once enabled.
  6. Add documentation so these tools can be adopted by users.

We’re looking for feedback and alignment with the community on this subject before proceeding with the design and implementation -- please let us know your thoughts.

Great explanation and summary of the current scenario.

Suggestion: We could add "Write tests to those profiling tools" to the roadmap, as I think one of the goals is to have these tools as first-class citizens in the future.

Another suggestion: a better standardized format for the profiles, and tools that convert legacy formats to it. The current v8 CPU profile format is quite limiting considering it's JSON so it cannot be concatenated easily and has to be in the memory as a whole before being serialized.

In 0x (flamegraph tool) I've re-implemented stack tracing on top of v8 profiler (--prof) and the v8 tick processor (--prof-process --preprocess) – see davidmarkclements/0x#99 - it's currently behind a flag but next major version this will be default with a --kernel-tracing flag for OS level tracing.

Pros

  • no need for sudo
  • totally cross platform - runs on anything that Node does
  • zero config to make it work in vms and containers
  • single thing to maintain (if I wasn't keeping kernel tracing support)
  • works around the problem with kernel tracing and turbofan - namely that interpreted (functions which do not get optimized) are opaque behind BytecodeHandlers (the trace output will have series of Bytecode related stack frames instead of the actual function the bytecode is handling). V8 prof does the memory address lookup to resolve these stack frames. This is considered a V8 bug https://bugs.chromium.org/p/v8/issues/detail?id=7155 so hopefully that will be fixed in future... but we still have the whole Node 8 line to date without it.

Cons

  • no I/O stacks and any third party C – if there's a hot path in libuv or some C/Cpp module it's not going to show
  • higher overhead
  • more difficult if you want realtime processing - while the isolate logfile is written to incrementally, you can't fully control the name of the file (isolate-ADDR-v8.log where ADDR is determined internally) and it's incomplete JSON until the process finishes.
  • the output format isn't great - best way is to use v8 tick processor but thats an all at once operation, again blocking to incremental processing
  • platform/configuration/hardware dependent timestamps - the prof output uses TimeTicks::HighResolutionNow which is an internal v8 function which has different output depending on system. In a particular configuration of OS X for instance, the TimeTicks::HighResolutionNow output is based on amount of ticks (it just adds one for each tick) – this makes it incredibly difficult to reliably cross reference stacks capture by --prof to stacks captured by anything else. Not only that but how do you match the sample rate so that you get one-to-one on the samples?

What I would love is the ability to cross reference prof stacks with OS level tracing stacks. If we can do that, then perf being deprecated, and bytecode handler problem (and any other problems) would be resolved with --prof while lower level tracing could be used to compliment --prof.

In terms of Node internal implementation, absolutely agree on tests.

The way we currently expose --prof-process isn't ideal, the stacks processor is concatenated JavaScript based on some JS files in deps/v8/tools - .. with additional injected code to make log output work with Node (see nodejs/node#14966)

We recently added production grade continuous sampling to our product. For that we had to find a sweet spot between sampling frequency and duration. Additionally there was a memory leak in v8::CpuProfiler (nodejs/node#14894) we discovered.
Here is a blog post with screen shots https://www.dynatrace.com/news/blog/introducing-code-level-visibility-for-node-js/

As already mentioned, this approach won't give you native frames - still we considered it to be sufficient for many use cases.

@Hollerberg who implemented it will be at the summit and we are happy to discuss the validity of our approach and be part of an initiative to improve CPU sampling capabilities in node.

We use the CPU profiler in the profiling agent here that is an experimental way to continuously collect the data from production Node.js apps. Some notes on that experience so far:

  • The testing and better overhead guarantees from the runtime would be very much desired. "How much will it cost my prod?" is a common question and it's hard to answer that without the implementation backing that claim and testing protecting it.
  • The heap profiling is not less important than CPU profiling, so it would be good to discuss that as well, maybe separately. The heap state is interesting in production for both investigating occasionally OOMs and understanding the causes of any excessive pressure on the heap (which in GC languages turns into wasted CPU cycles).
  • As a format, we use the proto-based profile.proto format a lot and in the open-source pprof tool, so that would be one possible solution to consider as the unified format.
  • For the native stacks, this would be good, yes. What is also desired is that the profiler is capable of profiling the CPU time of all threads in the process. The profiler we use now only samples the Node thread and when that thread is doing nothing, it's hard to say whether it does nothing because there is no work to do or because it's waiting for something in libuv to finish. Having at least some visibility into what happens in libuv would be highly desired and profiling all threads in the process is one way to go.
  • Regarding perf support - OS profilers are nice but there are types of virtualized environments where perf hardware events are not available, so the benefits of OS-level profiling are smaller there. Also things like heap profiling are only available within the process as tight cooperation with the runtime is necessary.
  • Regarding being able to turn the profiling on and off, this is actually possible today via the StartProfiling / StopProfiling interface.

I believe it is important that Node.js have first-class tooling for problem investigation and CPU profiling is part of that along with tracing, debugging, heap dumps, node-report and being able to introspect core files. Ideally, a base level of core tooling would be available with the runtime without having to install additional components as that can often be an issue in production.

That's a long way of saying that I support your effort and will look to see how I can help in respect to getting ci testing and other supporting pieces in place and also how I might be able to pull in people from IBM to help collaborate.

For reference: #150

Another issue that @bmeurer mentioned to me is that there is no way for Linux perf and any other perf tool outside of V8 to discern inlined functions from its caller, since both occupy a single stack frame. The CpuProfiler doesn't provide this either, but @FranziH is addressing this.

Updates from the Diagnostics Summit

External profilers

External profilers are working well on Node 6, even though they are not officially supported. They are also working on Node 8+, but the information collected by them can be misleading due to the introduction of Turbofan and Ignition. Also, the current method used by most external profilers to resolve JITed functions (--perf-basic-prof) is not officially supported on V8 and might break/be removed in the future.

To support external profilers in the future, we need two things:

Interpreted Frames

After the introduction of Turbofan and Ignition, Interpreted Frames on the stack don’t reflect JavaScript function calls since only the interpreter appears in the stack, as we can see in the image below.

image

As a consequence, understanding the data collected by external profilers can be tricky, and the data can be misleading since there's no way to distinguish between different JavaScript function calls when they are running in interpreted mode. As soon as those functions are compiled and optimized by Turbofan, they will appear in the stack as before.

During the Summit, we came up with three different approaches to have more meaningful information on the call stack for external profilers. All of them must be implemented on the V8 interpreter, and they basically change the current flow to add a unique stack frame for each JS function.

Intermediate function before calling the interpreter

Add an intermediate frame to the stack which points to a JIT-function with the purpose of keeping track of which JS function is being called.

image

Duplicate the interpreter code for each JS function

Copy the InterpreterEntryTrampoline code for each Interpreted Function, this way each Interpreted Frame will have a unique entry point. Apparently, ChakraCore is implemented this way.

image

Change the stack on runtime to replace the interpreter with a unique address representing the JS function

Hack the call stack at runtime, replacing InterpreterEntryTrampoline's return address with the address to a small JIT function which will return to InterpreterEntryTrampoline later.

image

API with information to resolve JIT function addresses

External profilers can’t resolve names for JITed functions (including V8 Builtins) without help from the runtime. Also, most of the time those names are resolved with post-processing tools. Today we have --perf-basic-prof, which is used by a variety of tools to post-process the output from profilers and extract useful information from that.

As --perf-basic-prof is very limited and is not officially supported by V8, the suggestion is to create an API which listens to code creation events and expose all information needed by external tools to resolve names for JITed functions.

V8 Profiler & CpuProfiler

V8 builtin profilers are officially supported, but they can only sample JavaScript frames. We discussed the possibility to add native frames to those profilers as well in the future. Some challenges are 1) sampling frames from other threads; 2) sampling syscalls frames.

Roadmap

Related issue: #150

Thanks for the great write up of the discussion from the summit.

OP said:

perf(1) support is now deprecated in V8 and will not be supported starting in Node 8 -- which effectively means we’re losing the ability to profile JS stacks.

AFAIK, perf(1) support is not deprecated. Like any of the other profiling APIs, it doesn't come with an official support expectation from the V8. I do not think this perf support is going anywhere anytime soon.

@ofrobots the decision deprecating perf(1) support came from discussions with @bmeurer, @hashseed, and @fhinkel. If this has changed then great -- but I'd like to just ensure we're all aligned.

Let me try to rephrase this: perf-based tools are not officially supported because they were never intended for public use. As result, they did not get any attention when we launched TurboFan and Ignition in V8 5.8.

That being said, they worked fine for a long time before 5.8 and if fixed, I expect that they will stay functional for a long time. I don't expect major changes like TurboFan/Ignition to occur often.

In addition to what @hashseed said:

I expect that they will stay functional for a long time

The time span here is in years since a new compiler/interpreter won't replace Turbofan/Ignition overnight. Also, it won't happen by surprise: we knew Turbofan and Ignition were coming years before they landed on Node.js core.

I don't expect major changes like TurboFan/Ignition to occur often.

I believe when this happens it will take the same path to upgrade done in the past: the new compiler/interpreter is introduced in the codebase, and the old one is replaced in steps until it can be removed. That gives us time to help to make the new compiler/interpreter work well with external profilers (especially if we want to provide 1 Tier support for external profilers in Node.js, which I think we should).

@hashseed so is the recommended path forward from the V8 team to use the perf-based tools as opposed to creating an new API? If so is there some level of commitment to keeping them working (along the lines of what we have now with V8 running Node.js tests ?)

I recommend using the CpuProfiler, since this is the one that both Chrome DevTools and GCP Stackdriver uses. But I can see that the current feature set makes it unattractive in some use cases.

Perf-based tools are not designed with public use in mind, and I cannot make hard commitment to support them in the long run. What I want to avoid is to be in the situation to have to block changes to V8 because they may break perf-based profiling in some way. I don't expect this to happen soon or often though. Also, the breakage we are talking about, introduced by Ignition, is something I hear no other popular interpreted language runtime supports either.

That being said, we currently don't even have any test case for perf-based profiling at this point, again due to the nature of them being ad-hoc tools not designed for public use. I would welcome test cases in Node.js so that we could at least notice if they break, so that we can make informed decisions whether to fix them if effort to do so is reasonable. The current situation is that we cannot honestly claim to offer official support if we wanted to.

CPU Profiling deep dive WG meeting

Time

UTC Thu 15-Mar-2018 19:00 (07:00 PM):

Timezone Date/Time
US / Pacific Thu 15-Mar-2018 12:00 (12:00 PM)
US / Mountain Thu 15-Mar-2018 13:00 (01:00 PM)
US / Central Thu 15-Mar-2018 14:00 (02:00 PM)
US / Eastern Thu 15-Mar-2018 15:00 (03:00 PM)
London Thu 15-Mar-2018 19:00 (07:00 PM)
Amsterdam Thu 15-Mar-2018 20:00 (08:00 PM)
Moscow Thu 15-Mar-2018 22:00 (10:00 PM)
Chennai Fri 16-Mar-2018 00:30 (12:30 AM)
Hangzhou Fri 16-Mar-2018 03:00 (03:00 AM)
Tokyo Fri 16-Mar-2018 04:00 (04:00 AM)
Sydney Fri 16-Mar-2018 06:00 (06:00 AM)

Agenda

  • Explaining implementation [optional, 5-10m]
  • Demo solution [optional, 5m]
  • Integration with Node: testing, support, release, and timeline for these
  • Testing use-cases
  • What to do when it breaks due to a V8 update
  • CodeEventListener: API, integration, and timeline
  • Future plans (extending CPUProfiler?) [only if we have time]

@hekike I expect this deep dive is focusing only on external profiling (e.g. via perf,..) - not in-process profiling (e.g. using V8 CpuProfiler APIs).

I'm travelling so I won't be able to make that time. Any chance we could do it Thurs/Friday instead?

@hekike - a little confusion in your time table - date says Mar 07 not Mar 14. Also, please note that this sunday (3/11) is when we move clocks forward for DST in US & Canada, so we'll be UTC -07:00. This will affect some times above.

@Flarna CpuProfiler APIs is an optional future topic, I'm not sure we will have time for it, so I think your expectation is correct.

@mhdawson I would be okay with either Thursday or Friday, would it be okay for you @mmarchini and least for one person from the v8 team? (cc @hashseed @fhinkel).

@mike-kaufman sry, my bad, I used the table from the last WG meeting, how can I generate such a nice markdown timetable?

I think I can join. With the DST change, what time are we talking about here?

Both Thursday and Friday work for me. I can give an overview of the current situation at the start of the meeting (challenges, proposed implementation & demo).

I think 7 pm UTC is 8 pm Berlin, 3 pm East Coast. Works for me this Wednesday and Thursday.

how can I generate such a nice markdown timetable?

@hekike - see here

Friday @ 7pm UTC should work for me as well on Friday.

Call me old-fashioned, but I strongly prefer Thursday evening over Friday evening :)

Okay, as I see Thursday works for everyone and it's the preferred.

UTC Thu 15-Mar-2018 19:00 (07:00 PM):

Timezone Date/Time
US / Pacific Thu 15-Mar-2018 12:00 (12:00 PM)
US / Mountain Thu 15-Mar-2018 13:00 (01:00 PM)
US / Central Thu 15-Mar-2018 14:00 (02:00 PM)
US / Eastern Thu 15-Mar-2018 15:00 (03:00 PM)
London Thu 15-Mar-2018 19:00 (07:00 PM)
Amsterdam Thu 15-Mar-2018 20:00 (08:00 PM)
Moscow Thu 15-Mar-2018 22:00 (10:00 PM)
Chennai Fri 16-Mar-2018 00:30 (12:30 AM)
Hangzhou Fri 16-Mar-2018 03:00 (03:00 AM)
Tokyo Fri 16-Mar-2018 04:00 (04:00 AM)
Sydney Fri 16-Mar-2018 06:00 (06:00 AM)

Upstream CL to make interpreted frames distinguishable by external profilers: v8:959081

@hekike - is there a link for the deep-dive hangout?

edit - nm, I see this is now on 3/15.

Not yet. I'll share it here on tomorrow before the meeting.

Reminder: The meeting will start in a few minutes.

I played with the idea of introducing native frames to --prof and v8::CpuProfiler, but had a few setbacks (especially with v8::CpuProfiler). A draft of the implementation can be found here: nodejs/node-v8@f6d9549

It works quite well with --prof because its output format will have all the sampled stack addresses for each tick, along with the information necessary to resolve V8-generated symbols and with information about linked shared libraries to resolve native symbols. Since the output of --prof is processed after the sampling is over, resolving native symbols is quite easy.

On the other hand, making it work with the v8::CpuProfiler is not trivial, because v8::CpuProfiler resolves symbols during execution time instead of lazily - like --prof or Linux perf. It also drops any address it can't resolve, and since V8 has no knowledge of Native Symbol names, it will drop all native frames, even though it uses the same backend (v8::SafeStackIterator) used by --prof. To solve this, we need to either: allow lazy-resolving symbols when necessary or teach V8 to resolve C++ symbols.

The first approach is probably more performant, but requires us to change the output format of the v8::CpuProfiler to also give information about unresolved frames (and maybe about linked shared libraries), but we need to know if there's interest on the V8 side to introduce something like that. We should probably align with ChromeDev Tools team as well since they use the .cpuprofile format.

The second approach would need us to either implement an Elf reader (and equivalent on Windows) inside V8 or call an external program (like nm) to resolve native symbols. We could use libelf for that, but I'm worried about portability issues here.

Honestly, I don't know which approach we should take, or even if it's a good idea to introduce native frames to v8::CpuProfiler. Ideas are appreciated. @hashseed @fhinkel @ofrobots WDYT?

@mmarchini if we could turn on and off --prof at runtime, I think it would be very nice.

@mcollina I agree, but I think --prof is not officially supported.

@hashseed

Another issue that @bmeurer mentioned to me is that there is no way for Linux perf and any other perf tool outside of V8 to discern inlined functions from its caller, since both occupy a single stack frame. The CpuProfiler doesn't provide this either, but @FranziH is addressing this.

The link you provided (https://bugs.chromium.org/p/v8/issues/detail?id=7203) is not public. Is there a way to make this public?

Thanks!

@mmarchini Thank you! FWIW, and for others hitting the same issue: the link above broken, the correct link is https://bugs.chromium.org/p/v8/issues/detail?id=7203.

@mmarchini - can this be closed now?

@mike-kaufman let me double check first. Might be worth keeping it open (or opening another issue) to discuss improvements to V8 CpuProfiler, but I don't think we need to keep it in the agenda anymore.

should this remain open? [ I am trying to chase dormant issues to closure ]

Was this discussion ever resolved? It's unclear from reading this thread whether there is a current recommendation and/or roadmap for production profiling.

In my own tests, I found the following:

  • CpuProfiler, and by extension inspector and pprof, blocks the event loop every time it's started for several milliseconds. Since the only way to get profiles at runtime is to start/stop, it makes it unsuitable for production use. It also doesn't include native frames, meaning its usefulness is limited.
  • --prof only outputs when the process is stopped, making it unsuitable for production.
  • perf seems to be the only viable solution right now, but it's not officially supported and only available on linux.

Is there still a plan to have a production-ready profiler available for Node?

Since the only way to get profiles at runtime is to start/stop

That's incorrect, both the C++ CpuProfiler API and the CpuProfiler inspector domain allow you to take profiles on demand (with no need to stop the process). This is not well documented though, so I understand the confusion, and that's something we need to improve. The overhead is supposedly higher than perf, but still acceptable. I haven't seen a noticeable overhead when using it (at least not on current Node.js versions). Some folks here have expressed concerns wrt performance of V8 CpuProfiler before, but if that's still an issue we need reproducible code to share those concerns with V8, otherwise the situation will not improve.

--prof only outputs when the process is stopped, making it unsuitable for production.

Agreed, prof is unsuitable for production since you can't start it on demand, but it's worth noting that V8 does not support this flag officially.

perf seems to be the only viable solution right now, but it's not officially supported and only available on linux.

You're right, perf is the only system profiler we have examples on how to use today, and yes, it's not officially supported by V8, but we try a best-effort approach to keep it working (it's been working since v10 LTS, and since then we were able to fix most issues on V8 before they reached Node.js). I know some folks on Microsoft have successfully used the --interpreted-frames-native-stack flag with some Windows profiler to profile V8, so it should also be possible to do so on Windows and other operating systems, but someone interested in these platforms need to do the work to ensure it works and is tested (just like it happened with Linux).

Is there still a plan to have a production-ready profiler available for Node?

The biggest problem is that any solution for a "production-ready profiler" needs to be implemented and supported by V8. The current recommendation from V8 is to use the inspector protocol CpuProfiler. For folks who need native frames as well, I generally recommend trying perf, with the caveats you mentioned.

With that being said, it's worth noting that languages like Python don't have an officially supported production-ready profiler which also shows native frames. Java also doesn't have a officially supported, but perf is available with similar support level and caveats we have on Node.js/V8. Native-ish languages like Go and Rust might have better support (since they are closer to native code than JIT and interpreted languages), but I don't have enough knowledge of those environments.

That's incorrect, both the C++ CpuProfiler API and the CpuProfiler inspector domain allow you to take profiles on demand (with no need to stop the process)

I meant you have to stop the profiler to get the profile, not the process, sorry for the confusion. The main overhead in my tests came from v8::CpuProfiler::StartProfiling blocking the event loop. While over say 30 seconds this may disappear completely in the average, it still can cause a massive spike for any requests happening while StartProfiling is running.

Since the only way to capture a profile is to call StopProfiling, it means after capturing the profile that the profiler needs to be started again, causing another spike.

but if that's still an issue we need reproducible code to share those concerns with V8

Simply measuring the call to StartProfiling gives me a good 15-20ms. The max latency when using autocannon with a Hello World goes from 10-30ms to 200-400ms.

The biggest problem is that any solution for a "production-ready profiler" needs to be implemented and supported by V8.

Yes and this is my main concern since the profiler provided by V8 has a performance impact and doesn't support native frames.

Anyway, thanks a lot for the explanation! (and your work on node-linux-perf :)

I meant you have to stop the profiler to get the profile, not the process, sorry for the confusion.

I see. Do you see another use case for getting the result without stopping besides avoiding the start overhead? I'm wondering if it makes sense to propose a new API if we can fix the startup overhead.

The main overhead in my tests came from v8::CpuProfiler::StartProfiling blocking the event loop.

Yes, I heard that before but I haven't seen it in practice yet (but I didn't measure it closely either).

Simply measuring the call to StartProfiling gives me a good 15-20ms. The max latency when using autocannon with a Hello World goes from 10-30ms to 200-400ms.

Will do, thanks, this is super helpful :). How are you calling StartProfiling, via C++ or inspector protocol (or which module are you using)?

Maybe i'm wrong but wouldn't using the inspector protocol would mitigate the startup overhead since it's a seperate thread (and not impacting the event loop) ?

Maybe of interest: nodejs/node#23070 (comment)
I have not retried with up to date node version but I assume not that much has changed since that time.

We have seen a high RSS overhead during StartProfiling and as far as I remember it was also blocking the even loop quite long. Reason was that for each existing function some info must be copied into the profiler thread during StartProfiling. Profiling itself is efficient as it is in a different thread.
Main issue for us was that we could not predict nor limit how big the overhead is and the result were that OOM killer on some cloud platform terminated the apps.

For a user which fully knows his app and controls when profiler runs this is maybe no issue because the user can adapt it's setup. But for a monitoring tool it's quite a pain that apps stop working just because the tool is added.

I see. Do you see another use case for getting the result without stopping besides avoiding the start overhead? I'm wondering if it makes sense to propose a new API if we can fix the startup overhead.

Not really. Although I thought it was a bit weird to have to restart the profiler to get the current profile, I was fine with that API until I realized it causes an overhead, meaning you may not want to do it continuously in production.

Will do, thanks, this is super helpful :). How are you calling StartProfiling, via C++ or inspector protocol (or which module are you using)?

I tried both: with the inspector protocol and with pprof that provides a native binding.

Maybe i'm wrong but wouldn't using the inspector protocol would mitigate the startup overhead since it's a seperate thread (and not impacting the event loop) ?

@vmarchaud That was my understanding as well, but it doesn't seem to be the case since in my tests the impact of the inspector was exactly the same as with a binding to v8::CpuProfiler.

For a user which fully knows his app and controls when profiler runs this is maybe no issue because the user can adapt it's setup. But for a monitoring tool it's quite a pain that apps stop working just because the tool is added.

While I agree with the general statement that it's less of an issue for a single app, it still can be an issue if your service is expected to answer very quickly all the time, as starting the profiler could impact the app a lot for the time when it's blocking. Maybe doing the copying of existing functions only once would fix the issue completely? Something like CpuProfiler::prepare or otherwise just adding a way to get the profile without stopping.

Maybe doing the copying of existing functions only once would fix the issue completely? Something like CpuProfiler::prepare or otherwise just adding a way to get the profile without stopping.

V8 is always moving functions around and generating new code objects, so even though this is possible it would only cover a part of the problem.

We have seen a high RSS overhead during StartProfiling and as far as I remember it was also blocking the even loop quite long. Reason was that for each existing function some info must be copied into the profiler thread during StartProfiling. Profiling itself is efficient as it is in a different thread.

I'm still a bit confused how this is a problem on V8 CPU Profiler but not on https://github.com/mmarchini/node-linux-perf, since we do the same thing on that module.

(well, not confused about memory overhead, the module just spits functions to a file, which should be slower tbh)

Main issue for us was that we could not predict nor limit how big the overhead is and the result were that OOM killer on some cloud platform terminated the apps.

This is concerning and seems like a bug on V8 IMO. @Flarna do you remember which version you tested last? I remember someone from V8 did quite a bit of work to improve CPU Profiler memory usage and test coverage, and fixed quite a few memory leaks there.

I'm still a bit confused how this is a problem on V8 CPU Profiler but not on https://github.com/mmarchini/node-linux-perf, since we do the same thing on that module.

I didn't test the performance using that module yet, and assumed that it wouldn't be a problem since my understanding is that the mapping (or at least the heaviest part of the mapping) only happens once, and then the map file doesn't really move for the rest of the process lifetime.

When you start it, it will iterate to find every function on memory and write it to a file. After that it will just write relocation and new compilations triggered by V8. FWIW we didn't notice overhead with that module on Netflix so far. Also, you don't want to keep node-linux-perf running indefinitely as that would be the same as starting the process with --perf-basic-prof, and you'd end up with a insanely large .map file, potentially causing performance problems or filling up the disk.


Maybe doing the copying of existing functions only once would fix the issue completely? Something like CpuProfiler::prepare or otherwise just adding a way to get the profile without stopping.

V8 is always moving functions around and generating new code objects, so even though this is possible it would only cover a part of the problem.

I think I made a mistake on my comment above, CpuProfiler::prepare makes sense and it would work similar to node-linux-perf. I still think the overhead is fixable of V8 though (and more worth the time investment).

There is a v8 issue regarding this: https://bugs.chromium.org/p/v8/issues/detail?id=7719

Sorry I can't remember the versions. But it was after the major leaks were fixed and also other improvements were backported.
I just did some fast tests on my windows PC and my reproducer from the linked issue shows startProfiling takes >1 second. I know that sample is no showing a real app but it was based on a real app causing crashes that time.

@mmarchini I think you explained the root cause quite well. The overhead is because CpuProfiler::StartProfiling has to iterate all functions and copies small info for all of them into profiler thread. During a running session updates are sent whenever a function is created/moved/... but these are just single events each of them not blocking for a significant time. node-linux-perf has the startup overhead only once.

For small apps CpuProfiler::StartProfiling is also fine but if there are a lot function objects it's a lot overhead.

Most likely a prepare API would help but could end up in a memleak as function infos get never removed. Not sure if this plays well will several profiling sessions running in parallel (e.g. different tools used in parallel). Maybe one prepare per tool is needed.

I ran the example on a few Node.js versions:

8.0.0:

function creation done
RSS: 158.46484375, heapTotal: 133.16796875
RSS: 158.46484375, heapTotal: 133.16796875
RSS: 158.46484375, heapTotal: 133.16796875
RSS: 158.46484375, heapTotal: 133.16796875
restart profiling
RSS: 158.66015625, heapTotal: 133.16796875
RSS: 158.66015625, heapTotal: 133.16796875
RSS: 158.66015625, heapTotal: 133.16796875
RSS: 158.66015625, heapTotal: 133.16796875
RSS: 158.66015625, heapTotal: 133.16796875
restart profiling
RSS: 159.80859375, heapTotal: 133.16796875
RSS: 160.00390625, heapTotal: 133.16796875
RSS: 160.00390625, heapTotal: 133.16796875
gc unused functions...
RSS: 34.65234375, heapTotal: 7.4921875
RSS: 34.65234375, heapTotal: 7.4921875
restart profiling
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
restart profiling
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 35.7734375, heapTotal: 7.4921875
RSS: 33.875, heapTotal: 6.4921875
restart profiling

8.17.0:

function creation done
RSS: 149.13671875, heapTotal: 122.93359375
RSS: 149.13671875, heapTotal: 122.93359375
RSS: 149.13671875, heapTotal: 122.93359375
restart profiling
RSS: 398.75, heapTotal: 122.93359375
RSS: 398.75, heapTotal: 122.93359375
RSS: 398.75, heapTotal: 122.93359375
RSS: 398.75, heapTotal: 122.93359375
RSS: 398.75, heapTotal: 122.93359375
restart profiling
RSS: 413.69921875, heapTotal: 122.93359375
RSS: 413.69921875, heapTotal: 122.93359375
RSS: 413.69921875, heapTotal: 122.93359375
gc unused functions...
RSS: 332.1953125, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
restart profiling
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
restart profiling
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125
RSS: 301.21484375, heapTotal: 9.328125

10.0.0:

function creation done
RSS: 148.421875, heapTotal: 115.91015625
RSS: 148.421875, heapTotal: 115.91015625
RSS: 148.421875, heapTotal: 115.91015625
RSS: 148.421875, heapTotal: 115.91015625
restart profiling
RSS: 1107.91796875, heapTotal: 115.91015625
RSS: 1107.91796875, heapTotal: 115.91015625
RSS: 1107.91796875, heapTotal: 115.91015625
RSS: 1107.91796875, heapTotal: 115.91015625
restart profiling
RSS: 1122.609375, heapTotal: 115.91015625
RSS: 1122.609375, heapTotal: 115.91015625
gc unused functions...
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
restart profiling
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
restart profiling
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
RSS: 1048.05078125, heapTotal: 9.8046875
restart profiling
RSS: 1046.18359375, heapTotal: 8.8046875

10.19.0:

function creation done
RSS: 145.64453125, heapTotal: 115.33984375
RSS: 145.64453125, heapTotal: 115.33984375
RSS: 145.64453125, heapTotal: 115.33984375
RSS: 145.64453125, heapTotal: 115.33984375
restart profiling
RSS: 387.4609375, heapTotal: 115.33984375
RSS: 387.4609375, heapTotal: 115.33984375
RSS: 387.4609375, heapTotal: 115.33984375
RSS: 387.4609375, heapTotal: 115.33984375
RSS: 387.4609375, heapTotal: 115.33984375
restart profiling
RSS: 402.15234375, heapTotal: 115.33984375
RSS: 402.15234375, heapTotal: 115.33984375
RSS: 402.15234375, heapTotal: 115.33984375
gc unused functions...
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
restart profiling
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
restart profiling
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
RSS: 297.5078125, heapTotal: 10.234375
restart profiling

12.0.0:

function creation done
RSS: 143.7265625, heapTotal: 113.9140625
RSS: 143.7265625, heapTotal: 113.9140625
RSS: 143.7265625, heapTotal: 113.9140625
RSS: 143.7265625, heapTotal: 113.9140625
restart profiling
RSS: 407.86328125, heapTotal: 113.9140625
RSS: 407.86328125, heapTotal: 113.9140625
RSS: 407.86328125, heapTotal: 113.9140625
RSS: 407.86328125, heapTotal: 113.9140625
RSS: 407.86328125, heapTotal: 113.9140625
restart profiling
RSS: 422.8125, heapTotal: 113.9140625
RSS: 422.8125, heapTotal: 113.9140625
RSS: 422.8125, heapTotal: 113.9140625
gc unused functions...
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 319.97265625, heapTotal: 8.32421875
restart profiling
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 319.97265625, heapTotal: 8.32421875
restart profiling
RSS: 319.97265625, heapTotal: 8.32421875
RSS: 315.44921875, heapTotal: 5.82421875
RSS: 315.39453125, heapTotal: 6.32421875
RSS: 315.39453125, heapTotal: 6.32421875
RSS: 315.39453125, heapTotal: 6.32421875
restart profiling

12.16.1:

function creation done
RSS: 147.42578125, heapTotal: 110.3203125
RSS: 147.42578125, heapTotal: 110.3203125
RSS: 147.42578125, heapTotal: 110.3203125
RSS: 147.42578125, heapTotal: 110.3203125
restart profiling
RSS: 169.37109375, heapTotal: 110.3203125
RSS: 169.37109375, heapTotal: 110.3203125
RSS: 169.37109375, heapTotal: 110.3203125
RSS: 169.37109375, heapTotal: 110.3203125
RSS: 169.37109375, heapTotal: 110.3203125
restart profiling
RSS: 184.3203125, heapTotal: 110.3203125
RSS: 184.3203125, heapTotal: 110.3203125
RSS: 184.3203125, heapTotal: 110.3203125
gc unused functions...
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 79.98046875, heapTotal: 5.98046875
restart profiling
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 79.98046875, heapTotal: 5.98046875
restart profiling
RSS: 79.98046875, heapTotal: 5.98046875
RSS: 77.29296875, heapTotal: 4.48046875
RSS: 77.234375, heapTotal: 4.48046875
RSS: 77.484375, heapTotal: 4.734375
RSS: 77.484375, heapTotal: 4.734375
restart profiling

So it seems we regressed at some point, but those regressions were fixed. Memory wise looks like we're good latest v12. If you're still seeing the issue on Windows maybe this was not fixed on all platforms though.

I tried autocannon with a basic fastify server while enabling the profiler mid-run, and the results were:

{
  "requests": {
    "valid": false,
    "difference": "-0.99%",
    "pValue": 0,
    "significant": "***"
  },
  "throughput": {
    "valid": true,
    "difference": "-1%",
    "pValue": 0.35704004147139157,
    "significant": ""
  },
  "latency": {
    "valid": true,
    "difference": "0%",
    "pValue": 1,
    "significant": ""
  },
  "aWins": false,
  "bWins": false,
  "equal": true
}

a is a normal run and b is with profiler started mid-run.

So looking at it, the average is not affected that much. The max latency bumped from 4.77ms to 17.04ms though. How many requests were affected? It's hard to say, since autocannon only provides aggregated data, but considering that b served more requests than a, I don't believe it affected many requests.

For completeness I used bpftrace to measure how long StartProfiling took, and it takes ~10ms.

I also tried the same but instead of using the profiler I used linux-perf (just the package, I didn't profile with perf):

{
  "requests": {
    "valid": false,
    "difference": "-0.58%",
    "pValue": 0,
    "significant": "***"
  },
  "throughput": {
    "valid": true,
    "difference": "-0.56%",
    "pValue": 0.5709278955071249,
    "significant": ""
  },
  "latency": {
    "valid": true,
    "difference": "0%",
    "pValue": 1,
    "significant": ""
  },
  "aWins": false,
  "bWins": false,
  "equal": true
}

A little bit less of a difference. Max latency this time was 4.82ms vs 9.74ms, so linux-perf is still faster, which still surprises me. Maybe on a slow EC2 disk the impact would be different. bpftrace shows the synchronous part of linux-perf takes 1513 nanoseconds.

IMO the V8 CPU Profiler doesn't look so bad in this little experiment, but I totally understand there might use cases where 10ms overhead is unacceptable even for a single request. In these cases V8 CPU Profiler is not the right tool. But for most Node.js users this seems pretty reasonable.

With that being said, I don't like micro benchmarks like this 😅. They can often be misleading, in real world applications such a small variance can just become unnoticeable because of other factors like network, number of concurrent async operations, other instrumentation, etc. It can also be misleading the other way: a fastify microbenchmark doesn't have a lot of functions in memory, so a huge application might behave differently. I'll try to get some more realistic data using the tools I used here.

P.S.: For the first example, I didn't use the C++ API nor the inspector API. Both have unwanted JS<>C++ boundary overhead, which is unecessary. Instead I connected to the inspector protocol directly from an external process, which is how we should be advocating for folks to use the inspector protocol in production (I used websocat, but a separate Node.js process with ws also works).

I'm quite sure the key to see a diff is the number of functions. You could add something like

let a = [];
for (let i = 0; i < 1000000; i++) {
  const f = new Function("a", "b", "return a + b");
  f(i, i - 1);
  a.push(f);
}

in server and retest.

I tried this based on https://github.com/fastify/benchmarks/blob/master/benchmarks/fastify.js with autocannon but on my system but I get always a quite high value for maxLatency (even without above code and not enabling profiling). Therefore I don't trust this setup that much... but my notebook is for sure no production like setup.

It seems also that inspector API is blocking. I tried

console.time("start")
session.post('Profiler.start', () => {});
console.timeEnd("start")

and this takes >1 second if there are 1.000.000 functions wheres just a few ms if there are 100.000.

P.S.: For the first example, I didn't use the C++ API nor the inspector API. Both have unwanted JS<>C++ boundary overhead, which is unecessary. Instead I connected to the inspector protocol directly from an external process, which is how we should be advocating for folks to use the inspector protocol in production (I used websocat, but a separate Node.js process with ws also works).

The JS => C++ boundary should be not a problem here. The requirement to run a sidecar process to do profiling looks quite cumbersum to me. I think it should be more or less don't care if I interact with profiler within a process or from outside.

If I understand https://github.com/WICG/js-self-profiling correct work is ongoing to the profiling via standardized API.

I'm quite sure the key to see a diff is the number of functions

Agreed, which is why I want to see it in production. Looking at your example though, how realistic is it for an application to have one million functions? With linux-perf and --perf-basic-prof you can get a general idea of how many functions an application have (I'm pretty sure the ones I've seen so far have less than a hundred thousand functions, but I'll double-check).

It seems also that inspector API is blocking

It is, the API is kinda weird.

The requirement to run a sidecar process to do profiling looks quite cumbersum to me

There is no sidecar process. It's a one-off process you call (just like calling ps to see which processes are running, or top). If triggering it via an API is desirable, using a worker thread to make calls to the main thread inspector protocol is recommended. If having a worker is still cumbersome, that's something we might be able to fix on core.

If I understand https://github.com/WICG/js-self-profiling correct work is ongoing to the profiling via standardized API.

I forgot that existed and I'm happy to see it moving forward. I guess we could provide that API on our perf_hooks module, but I don't think it will address any performance concerns nor it will cover the native frames use case (although not solving the former is probably fine).

I guess this could be relevant to the discussion: profiling Java with Linux perf has a similar "startup cost" for the same reason, it iterates over to find all functions (but since Java applications are a lot bigger and can easily use 64Gb+ of memory, the time to iterate over everything is bigger).

I'm not sure how their integrated profiling performs thought.

Instead I connected to the inspector protocol directly from an external process, which is how we should be advocating for folks to use the inspector protocol in production

I did my tests from the same process, so it's possible the profiler is doing something with the captured data which was impacting the main thread. I'll try connecting externally as you did and report back.

The requirement to run a sidecar process to do profiling looks quite cumbersum to me. I think it should be more or less don't care if I interact with profiler within a process or from outside.

I definitely agree with this, especially nowadays where you are often limited to a single core per container, in which case doing work "externally" will still happen on the same core, although in a non-blocking manner (which could also be done with threads in V8 so I'm not 100% sure I understand why it would make a difference).

There is no sidecar process. It's a one-off process you call

It is a sidecar process if you want to continuously take profiles, say every 10 seconds, which is my use case. Not saying it's not a good option though, but it shouldn't be mandatory to get good performance.

I guess we could provide that API on our perf_hooks module, but I don't think it will address any performance concerns nor it will cover the native frames use case (although not solving the former is probably fine).

Yes, these are definitely 2 different issues. IMO both are important but unrelated.

If I understand https://github.com/WICG/js-self-profiling correct work is ongoing to the profiling via standardized API.

Is there any plan for this to be exposed by Node?

One thing to note is that all examples in the Node documentation are connecting to the inspector session from the same process. This should probably be updated if not recommended.

Is there any plan for this to be exposed by Node?

I suppose that the underlying implementation inside Chrome would use the same profiler as node, just with specific security implications addressed ? Tried to check inside chromium but couldn't find any C++ code that would answer that

I suppose that the underlying implementation inside Chrome would use the same profiler as node, just with specific security implications addressed ?

Ah yes very possible. I assumed it would be a new implementation that would address the shortcomings of the current implementation.

Just tried connecting from a different process and I get the same result: max latency is around 15-30ms without any Profiler.start command, and jumps to >100-200ms with calls to Profiler.start on a simple "Hello World!" Express app. I used autocannon with the defaults settings on both Node 12.6.1 and Node 13.12.0 with the same result.

Looking at your example though, how realistic is it for an application to have one million functions?

Millions is a bit of an exaggeration I agree, but there are definitely use cases for thousands of dynamic functions being created in quick succession over the lifetime of the process.

It looks like 10,000 function instead of 1 million makes no difference compared to just a few functions, so 1 million is probably past some kind of threshold. I would say then that this example is probably not an issue.

I definitely agree with this, especially nowadays where you are often limited to a single core per container

What I've heard from former Chrome DevTools developers is that the current version of V8 expects two cores to perform well, since GC, parsing, compilation, etc. all happen on depart threads now (not to mention the threads spawned by Node.js). So folks running on a single thread might already be taking a good performance hit

It is a sidecar process if you want to continuously take profiles, say every 10 seconds, which is my use case.

IMO, based on how things are implemented today, this should be done on a worker thread.

One thing to note is that all examples in the Node documentation are connecting to the inspector session from the same process. This should probably be updated if not recommended.

Agreed.

Just tried connecting from a different process and I get the same result max latency is around 15-30ms without any Profiler.start command, and jumps to >100-200ms with calls to Profiler.start on a simple "Hello World!" Express app

Apparently we ran similar tests but got very different results. I'll try with express instead of fastify to see if there are any changes. Which hardware did you use to run these tests? What parameters on autocannon?

hello.js
const inspector = require('inspector')
const express = require('express')
const app = express()

inspector.open(3333, '127.0.0.1', false)

app.get('/hello', (req, res) => {
  res.send('Hello World!')
})

app.listen(8080)
profile.js
const WebSocket = require('isomorphic-ws')
const ws = new WebSocket(process.env.WS)

ws.onmessage = ({ data }) => {
  // console.log(data) // only used to validate everything works
}

ws.onopen = () => {
  ws.send(JSON.stringify({
    id: 1,
    method: 'Profiler.enable'
  }))

  setInterval(() => {
    ws.send(JSON.stringify({
      id: 1,
      method: 'Profiler.start'
    }))

    ws.send(JSON.stringify({
      id: 1,
      method: 'Profiler.stop'
    }))
  }, 5000)
}
  • System: MacBook Pro 13-inch 2017, macOS 10.15.2
  • Node: 12.16.1 and 13.12.0
  • Autocannon: autocannon http://localhost:8080/hello (CLI)

Out of curiosity, why do you recommend using a worker thread for this? So far my tests showed no difference between using a worker thread or not. This is especially true if the underlying implementation is already using a thread.

Looking at your example though, how realistic is it for an application to have one million functions?

The problem I reported in #23070 was based on a real crash with a real application from a customer. That application had ~230000 functions. I increased it to 1M in my reproducer to get a similar memory pattern as the real application which requires a few hundred MB already without Profiler.
In that application quite some functions were GCed after startup initialization of the app was done.

That time the memory overhead was the main problem which looks far better now. But it still seems that the amount of work to be done by v8 main thread depends on the number of functions and it is synchron. Most likely because functions coluld move in memory otherwise.

I agree that such a lot functions is not the typical setup. The main issue for us was that that there is no mechanism to estimate the overhead nor to limit it therefore we stopped using the profiler.
Our use case is to continously profile with low interval to get a long term pattern not to profile/optimize something or analyse an app in a bad state.

Those are valid use cases, but most of the improvements as I mentioned above should be implemented on V8. So I guess we need to start to discuss how to get this upstream.

Added to the agenda so we can discuss how to proceed.

Removed from the agenda until someone has time to work on it (reach out to V8 folks, gather data, etc.)

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

I would like to discuss it and help. Re-adding to the agenda

Hi folks.

As previously discussed on WG Meetings, since it issue was created in 2018 a lot of work was been done and a lot of contexts were mentioned here.

Seems fair to close that issue and create another one about the performance of Profiler.start() and Profiler.end().
We are working also on expose the Deep Dive documents through the documentations/ folder. You can see it here.

Closing in favor of #444