minrk/wurlitzer

Wurlitzer hangs process when pipe/buffer is full

Closed this issue · 5 comments

gidim commented

Hi,

We're using Wurlitzer at Comet.ml and we found that when the C library prints a large string
the entire process hangs. We manage to create a test that reproduces this. Would it be possible to increase the buffer size or find a workaround?

def test_buffer_full():
    with pipes(stdout=PIPE, stderr=PIPE) as (stdout, stderr):
        long_string = "x" * 1000000 # create a very long string
        printf_err(long_string)

    # Test never reaches here as the process hangs.
    assert stderr.read() == long_string+"\n"
minrk commented

Thanks for the test!

I think the OS controls the buffer size for pipes, so we don't have much of an option there. My first guess for the test to pass was to switch stderr=PIPE to stderr=io.StringIO() which should be able to contain arbitrary size messages.

I ran the test again and it still hung. It turns out it's hanging in the call to fflush. I reworked how fflush is called in #16 and your test now passes. Does that work for you?

gidim commented

Thanks for the quick reply!
Unfortunately it's still stuck on my machine. I managed to identify that it starts when the string is
greater or equals to 65535 bytes long. This is very likely to be the default pipe buffer on OSX 10.13.3

When I removed the fflush calls completely it hanged on os.read(pipe, 1024). Not sure why reading from a full buffer should be blocking. Possibly a race condition ?

gidim commented

EDIT:
Just want to be clear that this test is failing (hangs) in OS X with python 3.6.3 and 2.7.14

Manage to dig a little bit deeper. When I dump the stack trace of all threads it's clear that the process is blocked on select():

  File "test.py", line 142, in <module>
    test_buffer_full()
  File "test.py", line 122, in test_buffer_full
    printf_err(long_string)
  File "test.py", line 25, in printf_err
    libc.fprintf(c_stderr_p, (msg + '\n').encode('utf8'))
Thread 0x0000700005250000 (most recent call first):
  File "/Users/Gideon/Documents/dev/wurlitzer/wurlitzer.py", line 161 in forwarder
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864 in run
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fff9efa9340 (most recent call first):
  File "test.py", line 135 in sigshow
  File "test.py", line 25 in printf_err
  File "test.py", line 122 in test_buffer_full
  File "test.py", line 142 in <module>

From my understanding select() should never block when provide with a timeout value. It's possible (although unlikely) that passing 0 as the timeout is interpreted as NULL in the C side. When I set the timeout to 0.001 the test passes:

r, w, x = select.select(pipes, [], [], 0.01)

minrk commented

Interesting. The test passed for me, but when I run it more times, I can see that I didn't.

The hang does seem to always be in fflush for me, though, not select. I think I can see that there's a race condition there now - if the too-large send happens in between select and fflush, the fflush will block forever.

In general, now that I know fflush blocks when the buffer is full, I think that calling fflush from the same thread that's reading from the pipe is a fundamentally unsafe pattern. My first idea for this is spawning yet another thread that does nothing but call flush. That seems a little wasteful, but I don't see another way to solve the dueling problems of:

  1. sometimes select won't wake unless fflush has been called
  2. fflush will block if the buffer is full
gidim commented

I haven't seen problem 1 happening before so I don't have much to contribute to the discussion. Have you considered using a file instead of a pipe?