paulhoule/tentacruel

Better handling of transient errors in cron jobs

Closed this issue · 3 comments

I ran my "wake up" job this morning and it failed to run. I got the following error message:

Fatal error: protocol.data_received() call failed.
protocol: <tentacruel.HeosClientProtocol object at 0x7f6e80660048>
transport: <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
File "/usr/lib/python3.7/asyncio/selector_events.py", line 813, in _read_ready__data_received
self._protocol.data_received(data)
File "/home/paul/src/cron/tentacruel/tentacruel/__init__.py", line 93, in data_received
self._handle_response(jdata)
File "/home/paul/src/cron/tentacruel/tentacruel/__init__.py", line 112, in _handle_response
message = {key: value[0] for key, value in parse_qs(jdata["heos"]["message"]).items()}
KeyError: 'message'
Task was destroyed but it is pending!
task: <Task pending coro=<HeosClientProtocol._setup() running at /home/paul/src/cron/tentacruel/tentacruel/__init__.py:183> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6e74139d38>()]>>

Note the root cause in that error message is not that data_received() failed (it worked just fine) but that
the 'message' key was not found.

One thing I want to know is which exact call failed.

It could be that a particular result really should have come back empty. We should never get a crash like that if this is the case.

It also could be that the system was slow to wake up and behaved abnormally for a short time.

To clarify the remarks above.

The reception of data worked just fine. Then the asyncio system called a data_received() function I wrote
and that failed.

That particular selector comes from selector_events.py and if I understand it right, something might
happen differently if I used a different eventloop implementation.

Once the data_received call throws an exception we lose control of what is happening. That is real
bad. This may mess up our connection, but it also may mess up our own data structures since we have a list of futures associated with callbacks and that future will still be in the list of callbacks if we do nothing about it.

Note additionally that we are getting back no "message" in the branch where we are handling a server-generated event.