nodejs/diagnostics

Diagnostics "Best Practices" Guide?

mike-kaufman opened this issue Β· 33 comments

Something that would be interesting & valuable to the community is if the WG could come up with a set of diagnostics best practices/techniques for production applications. I'm a little concerned that this is a bit too ambitious, and a little concerned that one person's "best practice" is another's "really dumb thing to do". But I'm willing to throw this out to see what comes out of it. :)

I have a few thoughts about how this could be structured, but would love to hear if anyone else has ideas/suggestions first.

It would be great if we can come up with even just a guide with links to good resources and advertise it so people know what to read when they run into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo, documentations like that would be a good place to redirect to instead of nodejs/help.

It would be great if we can put this together

into different types of problems in production. We always receive bug reports with obscure screenshots of resource usages in the core repo,

@joyeecheung, can you point me to some examples here?

@mike-kaufman There should be a lot of hits if you search for the issues labeled memory: https://github.com/nodejs/node/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3Amemory+

I love this idea, support this, and willing to participate / contribute in ways this group needs it and in ways I can.

While I agree that best practices can be subjective, there are elements of diagnostic steps that are impersonal and to the point - an example would be collecting and analysing heapdumps on memory leak.

Regarding the structure, I see we can have different flows such as:

  • organized based on symptoms [ memory, exception, performance, hang, crash ... ]
  • organized based on tools [ tracer, profiler, dumper, inspector, debugger, reporter ... ]
  • organized based on subsystems [ installer, net, fs, child_process, natives, uv, ... ]

I suggest the symptom based categorization as it leads to faster discovery for consumers.

Rgarding the best practice content, again I see few models:

  • dump of complete diagnostic steps on a failing case, with example code
  • screen-shot assisted illustration of diagnostic methodology
  • plain text elaboration of tools and their usage

I suggest the first model as it leads to improved education for consumers.

@joyeecheung, @gireeshpunathil thanks. Let me strawman an outline and see if this matches what's in your head - feel free to tear it down. :)

  • Production Configuration Best Practices
    • application logging & log management
      • breakouts for different app types
        • server apps
        • serverless
        • desktop
        • IOT
    • APMs
    • Node Internals Tracing
      • configuration
      • interpretation
  • Troubleshooting
    • Profiling Perf Problems
    • Analyzing Memory Leaks
    • Analyzing Core Dumps

thanks @mike-kaufman . While the troubleshooting part is straightforward for me to relate, the first part (production configuration best practices) looks very wider in scope to me:

  • is it possible to suggest configurations for such wide deployment scenarios?
  • even with say server apps, is it possible to generalize configurations?
  • is it possible to propose configurations independent of execution environment of such app types?

I gave a talk on NodeSummit about this topic, where I showed 6 tools suited for production environments. Even though the topic is subjective, I don’t think there’s much disagreement on which tools or techniques should be used (our current pool of production tools is not that large).

The tools I showed alongside the examples are available here if anyone is interested. I would like to help to write these guides :)

I had a discussion with @mike-kaufman and the consensus was to start with a draft and iterate over PRs and refine through collective intelligence.

So let us start with say Profiling perf problems and then others can follow the structure. I will start looking at dump debugging.

@mmarchini - that's a great start :)

Thanks Gireesh. I'd like to ultimately get content the written to leverage github's auto-html-site feature - i.e., we'd be able to submit markdown updates to this repo, and it will be automatically renedered to html available at http://nodejs.github.io/diagnostics/bestPractices/.

@mike-kaufman auto-generating a website is an awesome idea! But maybe we should try to coordinate with @nodejs/website to have this content available in https://nodejs.org/ as well?

@nodejs/website to have this content available in https://nodejs.org/ as well?

Yes, this would be good. I think as the first step though, we can start getting the content organized, and the github.io "auto-magic-web-site" is a really simple & cheap way for us to get that content rendered & reviewable, w/out the distraction of how we plug into their process.

@bnb - is there any thinking on how we can plug in content to new website? Ideally, we'd have a bunch of markdown & images here (in diag repo), and this would just get "sucked up" into the website.

Please get in touch with @nodejs/website-redesign, as we are in the middle of planning the content structure for the website relaunch:

https://github.com/nodejs/website-redesign/issues/

For what it's worth, a long time ago I had written a set of guides to investigate various types of production issues with Node.js. @cjihrig kindly made those guides available publicly at https://github.com/joyent/node-debugging-methodologies. The content is mostly specific to SmartOS but the methodologies/concepts can almost always easily be transferred to other OSes.

@misterdjules - thanks, this is great!

There's a biweekly meeting for the website-redesign initiative, and the next one is on Thursday Aug 16th, 15:00 UTC.

Anyone here is welcome to attend. You're welcome to propose adding items to the agenda at the start of the meeting πŸ‘πŸ½ If you wanted to know where this Diagnostics best practices guide could fit in, that'd be a good place to get direct comments.

Otherwise, please make an issue to get the discussion going!

The initiative also working on creating guides for many things in Node.js. I don't have context for this diagnostics discussion, but it's possible a diagnostics guide could be part of the "analytics" or "ops" categories. Feel free to correct me here or on that issue :)

We should incorporate https://github.com/naugtur/node-diagnostics-howtos as well. @naugtur has been working to get some of these on the website. See nodejs/nodejs.org#1444

I'm interested in contributing to this guide. It's likely the flame graph one will finally come through and I'll choose something simpler for the next one :)

I like the idea to organize by symptoms.
Also, basic best practices should say what to switch on in production to even get the chance to diagnose anything (like profiling commandline switches or enabling core dumps on the OS level)

I think we should aim this for content as part of the new website redesign.

organized based on symptoms [ memory, exception, performance, hang, crash ... ]

IMO this is probably the most valuable for most users.

bnb commented

100% agreed.

@amiller-gh this is probably one you'll want to see πŸ˜„

As discussed and decided in the just concluded workgroup meeting (probably in the last couple of meetings) , I plan to setup one (or more, until convergence) meeting to define the next steps on this.

Goal is to be able to gather consensus on the content, and be able to identify some prioritization.

Does 21st Nov 9.30 AM PST work for everyone? please πŸ‘ if you will be able to make it and the time suits you, thanks!

thanks @mhdawson and @mmarchini for expressing availability. However, given the lack of reasonable number of responses, I think we cannot hold this meeting tomorrow, and move it to a later point in time.

Instead of me prescribing an alternative time, @nodejs/diagnostics - will you please express your availability within the next 10 days or so? based on that I can schedule one. thanks in advance!

or should I spin up a doodle?

@gireeshpunathil I think we should spin up a doodle, with a deadline to complete by end of this week and then choose meeting for next week.

I suggest opening a new issue for the meeting to make it more visible, and to put the doodle there.

If we still have a lower number of responses then I think we just need to move forward with whoever responds.

Let me know if I should make a point to come to this πŸ™‚ Otherwise, just sending a friendly reminder that I'd love to see at least one deliverable be a PR adding content to our future website docs, like what Flavio has done with getting started content here: https://github.com/nodejs/website-redesign/pull/105/files

Excited to see this happen!

here we work on our 'performance & diagnostics' best practices section:
goldbergyoni/nodebestpractices#256

I'll be glad to join our forces

a separate issue is spawned to track upcoming meetings to discuss this. #254

removed from wg meeting agenda, as per discussed in the last meeting (rationale: the work is being progressed as part of uesr journey deep dives and subsequent documentation work. If doc work stalls, we could always re-insert this to gain focus)

As a reminder, tomorrow we're meeting (same time as always) to discuss diagnostics on CPU usage.

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.