nodejs/node-v0.x-archive

Remove process.on("uncaughtException", ...) handler

felixge opened this issue ยท 45 comments

I propose that we get rid of this feature. So far I have seen nothing but bad things coming out of this, as it is entirely unsafe to continue the execution of a node program if the stack has been ripped at a random place.

People are also under the illusion that it is possible to trace back such an exception to the http request that caused it, because the express error handler does some magic that pretends it can handle this.

So as soon as Domains land, I think we should no longer allow a global exception handler.

Also: Mea culpa, I pushed for this being included initially.

+1 for that for the most of the cases.

But uncaughtException may be used to handle error and report it before process exit.

Maybe do not remove the handler but let the process exit anyway after event is emitted?
So one can listen for such an event report it and process exits afterwards

I don't know. uncaughtException isn't inherently evil and sometimes useful. For example, the cluster module uses it to kill off the workers when an error happens in the master.

But uncaughtException may be used to handle error and report it before process exit.

Not safely. By reporting you probably mean doing some I/O. What if the callback never fires? Note: setTimeout may no longer work if the exception occurred inside the callback of a timer. (The later could be fixed by making setTimeout callbacks "exception safe" if they aren't already, but it's still questionable to perform I/O after an uncaught exception).

I don't know. uncaughtException isn't inherently evil and sometimes useful. For example, the cluster module uses it to kill off the workers when an error happens in the master.

Useful yes, but there is no guarantee that the I/O for the notification will still work (see above). That's unless we make the entire node core exception-safe. And by that I mean that we must guarantee that any callback triggered by node will not be able to negatively impact global or module state by throwing an exception. (This may already be the case, but I doubt it / it would require an audit to verify).

My main point is that the current "uncaughtException" facility is a very leaky abstraction and I've seen many cases where it causes great harm / risk to companies using it (by accident via a module, or on purpose). The typical story I hear is: "We had this node process that just stopped doing stuff". In most cases this happened because they had an exception in a callback from their database client (which was not exception-safe), but their uncaughtException handler left the program running, but the db client was now in a bad/stale state where it would no longer answer queries.

--fg

Thinking about it more: I agree that "uncaughtException" handler can be useful / the problems I'm concerned about are rare edge cases. But I would really like to avoid having regular users deal with them as those edge cases are extremely hard to debug without a very good understanding of what is going on.

Agreed. However, removing uncaughtException or unconditionally exiting afterwards is a pretty big change. I'm not against it per se but it would help to know what modules and applications it will break (and why). Maybe bring it up on the mailing list?

@bnoordhuis when isolates is more stable cluster should use threads. So this is not a big deal to the cluster module.

But how should I debug rare daemon bugs?

Placing every single daemon instance into screen is not the option. Having an error and dead process is not the option either. Think about production.

Placing every single daemon instance into screen is not the option. Having an error and dead process is not the option either. Think about production

I'm not sure what you're talking about. How are you currently using 'uncaughtException'?

--fg

-1

I do not agree with this issue. In a socket and event based system, just because something failed to parse some json or some other random error bubbled up does not mean that my app needs to crash. Even if there are other processes handling connections, a crash will sever all current connections for that process and potentially cause problems for those connections. The whole point of uncaughtException for me is that it lets me handle many errors which are really not critical to my application and show not make it crash. By catching it there, I log it and allow the application to continue running just fine (and it does run just fine). In my opinion, exiting the event loop is the last thing I want to do.

tj commented

? express doesn't do anything with uncaughtExceptions

@felixge I have pool of node.js processes sharing the same server-side code on the one physical server. Processes running as daemons (node-init modules). If somehow I make a mistake that cause exception without try/catch (I may forget, I suppose, you may do that too) node just crash. Nothing else happens. Just crash without logging (daemons, you know).

With uncaughtException I can log this case and fix the problem as soon as I can. No downtime (production!), no shitty users, no shitty "nodejs is just not ready for production", just error in appropriate logs. Awesome.

If modules use uncaughtException โ€“ this is wrong. If main application use this - this is ok, at least for me.

With "running in screen" I meant starting screen process and running every single node.js process (what if I have 10, 100, 1000 on different servers?) in it to review exception stack trace later if process crash.

@shtylman

The whole point of uncaughtException for me is that it lets me handle many errors which are really not critical to my application and show not make it crash.

You have no way of knowing which exceptions are critical to your application. By the time they hit 'uncaughtException' the necessary context for determining that is not available to you.

By catching it there, I log it and allow the application to continue running just fine (and it does run just fine). In my opinion, exiting the event loop is the last thing I want to do.

@visionmedia

? express doesn't do anything with uncaughtExceptions

Hm, doesn't the error handler middleware try to handle exceptions? Or did I make this up? I'm pretty sure I have seen express show exception traces to me in the http response before, but now that you mention it, I don't see uncaughtException in the code. Are you try...catch-ing the route callbacks?

@bobrik

Nothing else happens. Just crash without logging (daemons, you know).

Uhm, why don't you make sure your daemons stdout / stderr are redirected to some log file?

--fg

@felixge my point is not that you should know the context, it is that such an exception (just because it bubbled up) is not necessarily application ending nor always so critical. I am fully capable to getting the backtrace in the exception handler function and dealing with it appropriately. If the system does not need to exit it should not exit.

tj commented

@felixge you can do next(err) to delegate error handling but that's only the ones passed to it no uncaught exceptions

@felixge they are, but I don't want to crash. I already have watchdogs for my instances, but I still don't want to crash in production.

@shtylman @bobrik did you read the original ticket description? "So as soon as Domains land, I think we should no longer allow a global exception handler.". I agree that crashing an entire node process is not always desirable, but that is not being discussed here. This discussion is about the finer points of how 'uncaughtException' currently fails to address this problem in a safe way.

-1: Domains are not defined and the scope is not known yet. Assuming it will solve all the use cases that uncaughtException does is premature to say the least.

With uncaughtException it is possible to implement domains in serialized callbacks, something that would be impossible to do without it.

I think it would be a mistake to remove this feature, at least while domains is not stable and we have a valid replacement for the use cases it currently solves. Miss-usage shouldn't be reason to remove it.

I tend to agree with keeping it. If you don't like it, don't use it. If you think it is being used improperly, advise against that. However, I think it's destructive to force opinions upon those that disagree, or might find creative ways to handle the potential problems that may or may not arise from its use.

I have been using this feature in almost all of my programs and so far I have not found a case where the continuation of the program proved to be unsafe, in other words corrupting data, not being recoverable, or causing other problems.

@felixge can you please explain what you mean by "it is entirely unsafe to continue the execution of a node program if the stack has been ripped at a random place", give an example of where this would likely happen in a Node program, and also what bad things you have seen that have come of it? I know that there are quite a few people who prefer to let their program crash and then automatically be restarted by some monitoring process. However, in my experience, as I said, I have not seen any advantage to doing this. So can you please elaborate as far as the unsafeness and problems you have seen.

If I really can't or shouldn't keep doing it, then I won't. But at this point with the information and experience I have, it is not logical for me to let my programs crash and have to be restarted.

EDIT: Actually I see a little more information that I didn't see before "The typical story I hear is: "We had this node process that just stopped doing stuff". In most cases this happened because they had an exception in a callback from their database client (which was not exception-safe), but their uncaughtException handler left the program running, but the db client was now in a bad/stale state where it would no longer answer queries."

Is this a node-mysql client that was involved in this unrecoverable state, and if so, is that type of situation where an uncaught exception leaves the module or client object from a module in an unrecoverable state typical and something that is difficult or impossible to avoid? Also, are you sure that it is necessary for the program to crash in that case, or would it maybe be acceptable for a new client to be created, or for that error to first be logged to a special file before the program exited?

If I remember correctly, in one of my applications I am using node-mysql, so thank you for all of your great work there. I am also using a pooling module.

I use uncaughtException for pretty-printing errors and maybe do some "graceful shutdown" sorts of actions. Like many things, it shouldn't be abused (too hard) but I think it's a little silly to remove this functionality just because sometimes people are stupid.

Edit: I always call process.exit(1) at the end.

We're not going to remove uncaughtException. Domains rely on it. Also, it's useful. If you don't like it, don't use it. Domains can be used to handle errors in a way that provides some context, but even when a domain has an error, you really ought to exit the process asap. That's the kind of thing that's really hard to enforce with an API. You just have to trust people to not want to run their apps in an undefined state.

I'm of two minds. I agree with @isaacs that it's useful. On the other hand, uncaughtException is like a double-edged sword - but at least a sword has a side without sharp edges.

If uncaughtException is going to be kept as it is, can you please update the doc and remove/update this sentence ( http://nodejs.org/api/process.html#process_event_uncaughtexception ):

Note that uncaughtException is a very crude mechanism for exception handling and may be removed in the future.

@isaacs Or somebody who has access should definitely update the docs as @thefosk mentioned above. I was mislead believing it might be removed for a long time, until I just read this thread. It would be good to get that corrected to avoid misleading others.

@thefosk @jwarkentin I just updated the docs in a2fd657 . @isaacs hope that's ok.

@felixge LGTM, thanks. Just to clarify that comment above, though, domains don't rely on the uncaughtException event any longer, as of 0.10

@felixge @isaacs Just a thought, it might be worth explaining the comments in those docs about the brittle state after exceptions are thrown as well as the comments on the documentation for domains. I was trying to understand why a restart would be recommended after catching exceptions like that so I posted a question on Stack Overflow. I hope the explanations given there were correct, but I don't think it's very obvious.

Better explanations would help developers make better decisions about what to do and if they are not going to shut down, at least how to try to write their code in a way that doesn't leak resources so much if exceptions are thrown.

Side note: We need Python style context managers in JS/Node.js for handling stuff like this better

I was wondering if someone could clearly explain to me, under what circumstances the user of 'uncaughtException' are unacceptable. I am trying to understand the technical reasons to not use it. All I have heard is 'it negatively impacts the global/module state' or 'program if the stack has been ripped at a random place'.

I have a reasonably large application, I want to catch things like parsing errors in places that have slipped through. I do all my IO only in initialization, so I can deal with this fine. My only other io is to redis, but redis already implements its own error handling and will not throw exceptions.

Is it safe for me to use this method (and continue my app running), provided I handle the initilization IO correctly? Or is this really not a good idea under any circumstance, and really I should just let my app die, disconnect all users (its a realtime game, so it cant really gracefully reconnect users easily, so this is a rather nasty solution)?

@Jezternz I actually was trying to understand the same thing and it makes a lot of sense now. I wrote a blog article on it. Check it out. In short, though, catching exceptions thrown in an unknown place in the program can leave resources around that should have been cleaned up, and leave parts of the program in an undefined state that may interfere or cause unwanted behavior in the application if it continues to run.

Thanks @jwarkentin, that has really helped me, I agree with all your concerns. It concerns me that languages like python and .net, simply trump nodejs in this area. with their 'using' or 'with' blocks, surely we can think up a creative way to address this in an async environment.

I think the resources can be managed better for sure. There's still the problem of the generally unknown state of the application if an exception is caught in a global handler where there is no context. It's always best to handle exceptions where they occur as much as possible, but in critical applications that can't afford to shutdown on a whim sometimes the tradeoff might be acceptable. You are always taking a risk when choosing to continue. At the very least exceptions should be logged and someone should be notified to fix the problem.

Personally, I've gone the route of removing my global exception handler and using forever to restart my server when it crashes. The problem of shutting down with so many websockets connected is somewhat mitigated by the fact that I run multiple instances of the server on different servers and load balance between them. It still is bad when one of the drops all the connections, so I try to fix all exceptions as quickly as possible. It's become quite robust over time and errors are handled much more cleanly.

@Jezternz Any language with shared mutable state, and long-jumping throws, will encounter this problem.

When you land in process.on('uncaughtException'), the jump is exceptionally long. However, even going up a few stack frames can be hazardous.

function doTheThings() {
  for (var i = 0; i < 1000; i ++) {
    doTheThing(i);
    cleanupThing(i);
  }
}

function doTheThing(i) {
  if (false === openKajigger(i)) {
    throw new Error('Failed to open kajigger');
  }
}

function cleanupThing(i) {
  closeKajigger(i);
}

try {
  doTheThings();
} catch (e) {
  // how many were open?  No one can know for sure,
  // but it's at least one more than was closed!
  // This is what we mean by an "undefined state".
  // even without asynchrony or uncaughtException,
  // it's still simply impossible to recover correctly,
  // in the general sense, without wrapping EVERY
  // potentially throwing line in a try, and knowing how
  // to recover from any failure, and that's usually absurd.
}

I'd suggest following the links in these comments to the other issues, where this is discussed at length.

Or, tl;dr - Use domains to handle errors by shutting down your program gracefully, with minimal user impact. Use cluster to isolate the scope of error states and do zero-down-time restarts.

If you understand why this rule is important, and you know for certainty that your case is different, and you've checked with someone else who also understands and they also agree it's a special case, then of course disregard this advice. B if you don't know why this is true in the vast majority of cases, then just follow the rule, and you'll be safe.

Note that with promises, you can easily build your own using-like context manager (safe from throws)

Asumes Q

function using(resource, chain, disposer) {
  var resourceP = Q(resource);
  return resourceP.then(chain).finally(function() { 
    return resourceP.then(disposer); 
  });
}

Here is how you'd use it:

function connectAndFetchSomething(...) {
  return using(net.connect(host), function(conn) {
    var stuff = JSON.parse(something);
    return doSomethingWith(conn, stuff)
      .then(function(result) { 
        return doOherThingWith(result, JSON.parse(other)); 
      )).then(...) // etc
      .then(function(finalResult) { 
        return finalResult; // if needed
      });
  }, function(conn) { 
    conn.end();
  });
});

The connection will be released after the promise chain within using completes, or if any error was thrown within that function (like JSON.parse) or its inner .then closures (like the second JSON.parse), or if a promise in the chain was rejected (equivalent to callbacks calling with error). Gets them all without crashing.

Would be even easier if a general resource disposal interface is agreed upon:

function using(resource, chain) {
  var resourceP = Q(resource); // wraps it in case the resource was not promise
  return resources.then(chain).finally(function() { 
    return resourceP.then(function(resource) { return resource.dispose(); }); 
  });
}

Then there would be no need to add the disposal callback:

function connectAndFetchSomething(...) {
  return using(net.connect(host), function(conn) {
    var stuff = JSON.parse(something);
    return doSomethingWith(conn, stuff)
      .then(function(result) { 
        return doOherThingWith(result, JSON.parse(other)); 
      ))
      .then(...) // etc
      .then(function(finalResult) { 
        return finalResult; // if needed
      });
  });
});

Putting it out there just in case core starts thinking about promises again (for node 2.0 or similar :) Sorry if its not the appropriate place to do it.

What's the progress on this? Will or will not be uncaughtException removed?

Not sure what happened: @felixge removed the part saying that the event may be removed in the future, but I still see it on the documentation pages:

http://nodejs.org/docs/latest/api/process.html#process_event_uncaughtexception

@IonicaBizau It's probably never going to get removed but that doesn't mean you should use it, it's broken by design.

Then docs should be updated...

How so? doc/api/process.markdown makes it abundantly clear that you should never-ever-under-any circumstance-on-pain-of-pain use it, doesn't it?

@bnoordhuis I mean, here, on documentation page, it says:

Note that uncaughtException is a very crude mechanism for exception handling and may be removed in the future.

However, and may be removed in the future was removed in this commit.

@IonicaBizau because it won't be removed.

Now I'm really confused: the markdown file contains the real content (it doesn't say it will be removed), but the documentation page, says it may be removed in the future:

screenshot from 2014-10-25 20 43 06

What's the process of updating/generating the documentation pages?

Looks like the documentation pages were updated. ๐Ÿ‘

I think that it should be configurable by process.fireuncought=false or by comd line, since both ways will leave people in disaster