cwzx/nngpp

Async demo has undefined behavior

Opened this issue · 6 comments

In the asynchronous REQREP demo, an exception is thrown in the callback [1]. However, the callback takes a direct function pointer [2] and nng may or may not have been built with -fexception (or equivalent) [3]. There is a function-try-block which looks like it will catch errors in the callback but the thread running the callback originates inside of the nng threadpool - which is in C.

One way to solve this could be to not directly provide nng with the user-provided callback. Instead, have the aio constructor take an std::function and provide that to an argument to a "trampoline" function which catches any and all exceptions from the user-provided function. This trampoline is what you would then provide to nng as the AIO callback.

cwzx commented

You are correct. We shouldn't be emitting exceptions from nng callbacks.
Your suggestion is a safe way to ensure this never happens.

Found the same problem. May I ask, has there been update on this?
@neachdainn what was your solution?
Do you know how nng reacts on errors in this situation? Lets say there's an error sending (in my case that would happen asynchronously) should we try managing the failed message ourselves or does nng do some kind of management? The original nng demo exits the process on any error, doesn't seem like it would manage anything...

This really isn't an nng thing - it comes down to the behavior of the compiler. The nng library operates within the C specification - meaning it doesn't address C++ exceptions at all. I think most modern compilers do something sane (I think), but the real solution is to catch exceptions at the FFI boundary.

In my own code, I catch exceptions at the FFI boundary (i.e., the callback). In my older code, I would occasionally check a flag in my own threads to see if an exception occurred. In my newer code, I just abort the program if I catch an exception - not ideal but better than undefined behavior. If someone comes up with a better solution, I would love to hear it.

EDIT:

May I ask, has there been update on this?

Based on the latest tag in this repo, I do not think so. The changes I described in the original post aren't too crazy, so perhaps I should just get around to making them myself.

One thing that might mostly work is to have aio_view::wait check a global / aio / context / socket specific variable and re-raise the exception if present. That should work if you can require that each asynchronous operation eventually is waited on, though that probably ends up being a project-specific policy rather than something a library can enforce.

@neachdainn Thanks much for coming back. I've added a thread-safe context to our worker class which allows me to inform the server coordinator in case of an error inside the callback. The problem is only... what then? As you said, we can't allow undefined behavior. But are there cases where we can save the running process? And when will we get errors in the first place? Does nng e.g. retry sending failed messages before it emits the error? Anyway, that's questions for Garrett/nng

Apologies if these responses are too delayed to be of any use.

I believe that if NNG reports an error, it will not attempt to do anything else with the message. That's the logical expectation given that the user owns the message upon error. That being said, some protocols are more "reluctant" to return an error than others; some have built-in retry (e.g., REP) and others will just drop messages (e.g., PUB).

If you're the one consuming the code, then the correct answer is probably pretty specific to your application. If you're writing a library, I don't know if there's a perfect answer. In my wrapper, I document that the user needs to either come up with their own scheme for handling errors or just not emit errors (though I could see plenty of argument for silently eating the errors instead of killing the program).