nerves-project/shoehorn

Shoehorn crashing node on errors in Application.start()?

dognotdog opened this issue · 4 comments

Environment

Nerves 1.9.1 on Raspberry Pi Zero W

Current behavior

The app exits, and takes the Erlang node with it. Errors occuring post Application.start() seem to be handled as expected, and are contained without affecting the rest of the node.

00:00:07.857 [info]  Application minispeck exited: XXX.Application.start(:
normal, []) returned an error: shutdown: failed to start child: XXX.Sensor
Supervisor
    ** (EXIT) shutdown: failed to start child: XXX.SensorReader
        ** (EXIT) no process: the process is not alive or there's no process cur
rently associated with the given name, possibly because its application isn't st
arted
 
[nbtty: terminating]
[   27.959711] heart: Erlang has closed.
[   27.984101] erlinit: Erlang VM exited
[   27.991786] erlinit: Sending SIGTERM to all processes
[   27.997751] watchdog: watchdog0: watchdog did not stop!
[   29.010179] erlinit: Sending SIGKILL to all processes
[   29.383207] reboot: Restarting system

Expected behavior

application_exited() callback being called, in general behaving as if application had crashed in any other place.

This looks more like something in XXX.SensorReader crashes the VM hard. Have you tried removing that child to see if things work normally as expected? Crashing the VM isn't really recoverable and I wouldn't expect shoehorn to be able to handle that

That was just a typical output, others were proper traces with argument errors and whatnot, while I was debugging that module, which seem like they should not crash the whole VM, as similar errors on things called after start(), eg in the SensorReader process, did not cause similar behavior. I'll try and see if this happens again if I can simply move the error out of the start() phase with a send_after(), as that should give definite proof one way or the other.

Alright, so some further digging shows that even after application.start(), this can still happen, maybe it's a weird effect of a :one_for_all supervisor tearing down something that communicates on I2C thus crashing the node, as the indicated process crash looks benign. I'll keep an eye on this, but it might be something unrelated.

Closing old issue since I'm unsure it still applies any more.