PandABlocks/PandABlocks-client

Handle Panda disconnect exceptions more elegantly

Opened this issue · 0 comments

During a late night run of I22 on @gilesknap 's container the following error was repeated many times:

ERROR:PandA did not respond to GetChanges within 1.0 seconds. Setting all records to major alarm state.
callbackRequest: ERROR cbLow ring buffer full
callbackRequest: ERROR cbLow ring buffer full
WARNING:socket.send() raised exception.
ERROR:Task exception was never retrieved
future: <Task finished name='Task-68034730' coro=<StreamWriter.drain() done, defined at /usr/lib/python3.10/asyncio/streams.py:348> exception=BrokenPipeError(32, 'Broken pipe')>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/streams.py", line 359, in drain
    raise exc
  File "/usr/lib/python3.10/asyncio/streams.py", line 359, in drain
    raise exc
  File "/usr/lib/python3.10/asyncio/streams.py", line 359, in drain
    raise exc
  [Previous line repeated 33623 more times]
  File "/venv/lib/python3.10/site-packages/pandablocks/asyncio.py", line 103, in _ctrl_read_forever
    received = await reader.read(4096)
  File "/usr/lib/python3.10/asyncio/streams.py", line 650, in read
    raise self._exception
  File "/usr/lib/python3.10/asyncio/streams.py", line 359, in drain
    raise exc
  File "/usr/lib/python3.10/asyncio/streams.py", line 359, in drain
    raise exc
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 924, in write
    n = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

We should improve this section of code:

except Exception:
logging.exception(f"Error handling '{received.decode()}'")

  except BrokenPipeError as e:
       logging.exception(f"Error handling '{received.decode()}'")
       await asyncio.sleep(<wait more time before trying again>)
  ...
  # Except other errors the panda should be able to handle 
  ...    
  except Exception as e:
       raise e

@coretl Thoughts?

Update

We agreed in a meeting that it's probably a good idea to completely shut down the pandablocks-ioc on such a failure, then let the kubernetes liveness.sh handle restarting it.