nodejs/diagnostics

Performance profiles for typical Node.js applications

joyeecheung opened this issue · 24 comments

During the break out session on day 1 of this week's diagnostic summit, we have discussed about providing performance profiles for typical Node.js applications e.g. CPU & memory usage, GC overhead, throughput, latency, .etc so the users could have a clear mental picture about how their applications perform and if there is enough room for optimization.

This seems to be somewhat overlapping with the work of the benchmarking working group, since they are looking for real-world workloads as well. Although in our case we don't look for a open code base, rather, we need some average statistics about typical Node.js applications.

Agree it is useful for end-users to be able to reason about perf characteristics. Challenge with what you're suggesting is going to be defining what is a "typical" application, and then what is a "typical" host environment. E.g., running on high-end dedicated hw is going to produce one set of values for throughput, running same app on a public cloud where you're randomly co-located on hw with "noisy neighbors" is going to produce an entirely different set of results.

A solid outcome (IMO, feel free to disagree :) is we have a set of applications that represent different workloads, and there are prescriptive instructions for how end users can consistently replicate and share the results (i.e., a fixed set of measures and fixed format report). This would enable more consistent comparisons, and enable comparisons across different host environments.

Agree this overlaps w/ benchmarking WG.

@mike-kaufman I'm more interested in having a sampling of real-world code than more "standard" benchmarks. The basic question that is often asked is "is my application as fast as it can be" (within reason)? Am I doing something wrong? I don't think having standard benchmarks would solve these scenarios.

Not sure I'm clear on the difference between "a sampling of real-world-code" and "a workload benchmark" (e.g. acme air). My thinking is you can define some canonical apps that represent certain workloads (e.g., a DB-backed CRUD app, a socket proxy app,...), and then these can be used to drive meaningful comparisons (host A vs host B, my CRUD app vs canonical CRUD app, node 9 vs node 10,...)

If you have an idea or some code that represents what you have in mind, that might help me better understand your point.

I think what we are looking for are:

  1. Categories of typical Node.js applications and their use cases, like "DB-backed CRUD app, a socket proxy app" that @mike-kaufman mentioned above - so when we put out a table, the users know where to look
  2. Their performance statistics e.g. latency, throughput, Apdex, with the latency of their external resources for reference - so users can adjust their expectations accordingly
  3. How those applications scale, either horizontally or vertically (where CPU & memory usage stats are useful), and what the relationship between resource usage and their throughput/latency are - this helps users know when they are spending more than they should, and helps them plan the allocations for e.g. cloud resources
  4. Availability of these applications: e.g. uptime/downtime, number of nines - so users know when their applications are not doing well, and how to adjust their SLAs
  5. Other factors: the framework/RPC protocol/etc. they use, the version of Node.js they are on, the cloud provider/host they are on, the monitoring solutions they use

The main difference between this and the benchmarking WG's effort is that we are looking for more real-world, in-production data, without actually getting the code. Therefore, we can obtain those statistics, say, using surveys with appropriate questions. This in fact looks a bit similar to the foundations' user survey questions, but goes into more details and looks for numbers.

Another way to get those data is asking the APM vendors to help us collect them after given consent from the users. In return they get the appropriate performance profiles to pattern match so they can remind the users when the applications underperform, and even give advices based on the data.

Another way to use this data/report: there are many users coming to the nodejs/node issue tracker with all kinds of memory/CPU usage graphs from their APM providers, asking questions about the performance of their applications, but without actual code, those issues are not really actionable and often get closed due to inactivity. With this kind of data in place, we can redirect them there instead of giving some vague answers, and this helps us triage the actual performance issues.

Another way to get those data is asking the APM vendors to help us collect them after given consent from the users. In return they get the appropriate performance profiles to pattern match so they can remind the users when the applications underperform, and even give advices based on the data.

I would not rely on APMs here.
If APMs report performance problems (and not just metrics) they use baselining and anomaly detection for a given application. There are just too many factors to rely on a static performance profile to compare against.

I'd also rather opt for a standardized set of key metrics that can be collected from a running process and then sent to a third party for further inspection.

Even better would be:

  • Users can opt-in to performance reporting
  • Performance snapshots are sent to a central repository
  • Big data analyses reports anomalies or improvements

Doing so would allow us to source comparable real-world performance information for many different configurations. It would be also possible to integrate this with APM via events.
Of course, this would involve plenty of work and would also require hardware to run the benchmarking service on.

If APMs report performance problems (and not just metrics) they use baselining and anomaly detection for a given application.

It's definitely hard to come up with an exact X for "any CRUD web service". At the same time a new service doesn't have a baseline but might be interested in how it stacks up. The thing that APM providers might know though is "what is the general profile of [a class of] node apps". E.g. what's the throughput/latency distribution/memory usage of web services that make a few database calls? Getting those numbers for one or two services will be a poor representation of "normal". Getting those numbers for hundreds or thousands of services and looking at distributions might be more telling.

So, entirely possible I'm not understanding what's being proposed here. That said, I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls. It's analogous to saying "water boils in n seconds" w/out controlling for initial temp of water, BTU output of stove, volume of water or altitude.

I'd also rather opt for a standardized set of key metrics that can be collected from a running process and then sent to a third party for further inspection.

This is an interesting way to frame this, and aligns with the trace-macros efforts & the eliminating monkey patching efforts.

That said, I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls.

Fully agreed.

E.g. what's the throughput/latency distribution/memory usage of web services that make a few database calls? Getting those numbers for one or two services will be a poor representation of "normal". Getting those numbers for hundreds or thousands of services and looking at distributions might be more telling.

Agreed - I can think of a combined metric that consists of

  • RSS, Heap usage
  • GC frequency
  • CPU usage
  • some event loop counters
  • RPS
  • Response time

Plus metadata like Node version, etc.

If we are able to collect that for a range of servers regressions between versions would be detectable.

I feel like the goal (having a set of KPIs that can be used to gauge perf of their app) lacks any controls. It's analogous to saying "water boils in n seconds" w/out controlling for initial temp of water, BTU output of stove, volume of water or altitude.

Yes, I agree. We need additional input for measurement. What I have in mind as the end result is something like a "performance calculator", or "cost calculator" - the users can give us:

  • What kind of work the application do
  • The version of Node they are using
  • For each request/unit of work, how many request to external systems/DB will be sent, what are the expected latency of those, how likely those will be triggered (e.g. if it's a external cache). There could be a vector of these types of requests because for example, a request that involves a lot of writes to DB has a different performance profile from one that involves mostly reads that are likely to hit the cache.
  • Other factors that we can come up with that are meaningful

We can give them in return:

  • Expected average/percentile of metrics, e.g. those mentioned in #161 (comment)

So, the ideal would be, the users can know if their existing applications are underperforming / costing more than they should by providing those input, and can have a better idea about the kind of cloud resources/host that they need to get before they bring something in production.

I think our first goal is to figure out what kind of input we need, and what kind of output we can give in return.

We can give them in return: Expected average/percentile of metrics, e.g. those mentioned in #161 (comment)

That sounds good but I wonder how to source the raw data to provide such insights.
From the data I have, I could not provide that to our customers without having some kind of cross reference to expected performance measures.
That's why I suggest to tackle that with a way to collect metrics from applications.

I'm +1 on the idea of having a standard set of "runtime perf metrics" that can come out of node. I think this is the first step to what's being suggested, and overlaps nicely with the goal of having more runtime diagnostic info output through trace macros.

If the scope here is the standard metrics, there's overlap with #131. Would be nice if we can come to some agreement on this & close/consolidate issues as appropriate.

RE Joyee's suggestion about the "cost calculator", I love the idea of it in the abstract. My worry is there are just too many variables at play to land this successfully. Ideally, we want to be able to surface easy-to-understand perf data in easy-to-use tooling so that "mere mortal" developers can get a sense of how their app is performing & what options they have to improve it.

Having a set of "runtime perf metrics" that are easy to extract and process would be fantastic. It's a great starting point.

This can also live in the community/ecosystem, something that you either enable with -r simplemetrics or that you can query live for a status endpoint or similar.

something that you either enable with -r simplemetrics or that you can query live for a status endpoint or similar.

I think this has some overlap with the tracing library. E.g., for ETW or Dtrace, you would just need to attach a listener and you'd get the events - you wouldn't need a special build or special switches at startup. Challenge here, I think, is a model that works across different trace libraries. See table here for some more detail.

Bumping this since some stuff had changes since early 2018 and specially the launch of OpenTelemetry, which works as a "standard" way to gather metrics/trace from an application and send them to any backend.
Even if it's hard to tell the user what is a good/bad value for a given metric, i suppose a good first step would be to make them available in the first place and i believe open telemetry would be a good place to start this implementation.

What do you think ?

I agree that working in the context of OpenTelemetry makes sense. I've done some work with OpenCensus and I think it's a good starting point for us to think about when talking about how to delivery base metrics.

@hekike FYI as well since I know you had contributes some changes to OpenCensus as well.

Yes, at the moment we are experimenting with using OpenCensus at Netflix for our Node platform. I would be interested in such a runtime metrics exporter, especially now that OpenTelemetry is happening. I could imagine an official solution provided by Node Core to instrument core modules like HTTP and extract performance metrics instead of OpenCensus's current monkey patching approach. (low-cost, stable built-in context propagation would be also awesome)
cc @mayurkale22

About the original question, I can ask around what data can we share with a community from our use-cases. We run various API workloads with technologies like GraphQL, gRPC, React rendering, etc. on both Netflix scale and smaller internal / partner scale.

I think we should distinguish here between metrics and tracing.
My understanding of metrics here are timeseries values like memory consumption, CPU load, GC time, EventLoop timings,... whereas traces are based on transactions linked together. Adding something like HTTP response times as a timeseries is often not that useful as different requests have different values so mixing them is not nice.

Adding metrics as described above into node core should be quite easy as reporting one metric has no side impact to the other and there is no needed to have a "world wide" working context passing in place.

Traces/Transactions are a lot harder as they require working context passing which in turn depends on user modules.

@Flarna the context is here that OpenTelemetry can extract both timeseries and tracing data from a process and report it to various backends. On your statement that timeseries data is always transaction independent I slightly disagree. For example for downstream RPCs, it can be useful to breakdown metrics by down/upstream dependencies.

Here are some examples:

So 👍 on separation, but I do expect in the future these topics will bleed into each other.

Fully agree here.

My main point was that there a lot interesting but transaction independent metrics out there which could be reported already now without waiting on the long running topic regarding context passing in Node.JS being closed (which actually may never happen at all).

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.