lampmerchant/tashrouter

TashRouter's LToUDP port crashing if started as a systemd service.

NJRoadfan opened this issue · 3 comments

Running into a weird problem. If I setup TashRouter as a daemon that launches on system boot, the LToUDP port driver crashes. Below is the debug output from Python:
Exception in thread Thread-10:

Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/sbin/tashrouter/port/localtalk/__init__.py", line 200, in _node_run
if send_enq: self.send_frame(bytes((send_enq, send_enq, self.LLAP_ENQ)))
File "/usr/local/sbin/tashrouter/port/localtalk/ltoudp.py", line 74, in send_frame
self._socket.sendto(self._sender_id + frame_data, (self.LTOUDP_GROUP, self.LTOUDP_PORT))
TypeError: unsupported operand type(s) for +: 'NoneType' and 'bytes'

Despite setting up the systemd service to wait until After=network-online.target and Wants=network-online.target, it appears that the system still tries to start TashRouter before the network interfaces are fully up. If I change the systemd service to depend on a later service starting (like atalkd) or intentionally add a 10 second startup delay to the service (via systemd or to function def start(self, router):), it starts fine as all the network interfaces are initialized and "up" by that point.

Notably, the system is not throwing "OSError: [Errno 19] No such device" in this case. Once TashRouter is in this state, I have to kill the process with kill -s 11 since systemd fails to stop the process on its own, so its a pretty bad crash.

Hm, looks like a race condition. LocalTalkPort is starting its thread and getting as far as sending a frame before LtoudpPort has set self._sender_id. Don't know why this case is stimulating it, but looking at it now, it's one of those "how did this ever work" situations... I'll have a fix in shortly. Thanks for finding this.

a416126 should fix this.

Fix appears to be working in testing. I suspect the trigger was the network stack wasn't initialized fully on boot (or otherwise slowing things down), allowing the LocalTalkPort thread to outrun things. When started from a command prompt after the system is up, this likely was never an issue.