kquick/Thespian

ValueError: file descriptor cannot be a negative integer (-1)

Closed this issue · 2 comments

Environment

  • Python version: 3.8.0
  • OS (uname -a): Darwin io 18.7.0 Darwin Kernel Version 18.7.0
  • Thespian version: 3.10.0
  • Transport: TCPTransport (on a single machine)

Error

When running an actor system on a single machine we get the following stack trace after a while:

2020-04-17 10:51:26.012604 p24308 ERR  Actor esrally.driver.driver.Worker @ ActorAddr-(T|:53459) transport run exception: Traceback (most recent call last):
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/actorManager.py", line 87, in run
    r = self.transport.run(self.handleMessages)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 74, in run
    rval = self._run_subtransport(incomingHandler, max_runtime)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/Users/daniel/Projects/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1143, in _runWithExpiry
    rrecv, rsend, rerr = select.select(wrecv, wsend,
ValueError: file descriptor cannot be a negative integer (-1)

In the error log we also see:

2020-04-17 10:51:26.016014 p24308 ERR  ValueError on select(#0: [], #3: [-1, 14, 13], #3: {13, -1, 14}, 0.166675)

I saw that the code in TCPTransport has handling for invalid file descriptors but it expects OSError to be raised.

I can reproduce this in our application when an actor sends large log messages so I wonder whether it is related. I also managed to capture a complete debug log output of Thespian (~ 13MB expanded; the error is towards the end of the log file).

The error is pretty well reproducible but the reproduction scenario is a bit involved (it requires running an application that is somewhat complex to setup) so I'm not sure whether it is helpful to describe it here. I am of course happy to replicate the steps here, provide further details and / or participate in testing but I wanted to get the issue out here first.

Thank you for the report and the debug log to help with looking at this. Having looked through the log, it looks like the remote Actor closed the socket at some point while there were still message queue to send to that remote queued, and the send got attempted. It's entirely possible for sockets to close, and the TCPTransport should accomodate those situations, but as you noted, it wasn't properly handling the ValueError scenario where the socket fd was invalid.

I've uploaded a fix (a99fbf9) that should properly handle the ValueError scenario. It's a little tough to reproduce this with local tests, but it sounds like it's not too difficult for you to reproduce, so I'm hoping you can confirm that this helps.

It's entirely possible that there's some other underlying issue that resulted in the socket disconnect; please check that the updated version is giving you the expected results and let me know if there's something else I can help with here.

[Note: apparently pushing a commit with the word "Fix" followed by a hash and a bug number causes github to automatically close an issue. TIL. I've re-opened the issue to wait for your confirmation and further testing.]

I ran tests for a full day, both with with version 3.10.0 and your commit a99fbf9 and can confirm that I never spotted the issue again with a99fbf9. Thanks Kevin for implementing the fix so quickly!