Reduce danger of accidental exception suppression

Question

Reduce danger of accidental exception suppression

Nikratio opened this issue 7 years ago · 5 comments

Consider the following code. It consists of an echo server with a bug, and a unit test for the server:

#!/usr/bin/env python3

import pytest
import pytest_trio
import trio
import logging

log = logging.getLogger(__name__)
PORT = 12346
BUFSIZE = 16384
async def echo_server(server_sock):
    try:
        with server_sock:
            await server_sock.recv(BUFSIZE)
            raise RuntimeError('Damn!')
    except Exception as exc:
        await trio.sleep(0.1) # make problem deterministic
        log.exception('Things went wrong...')

async def echo_listener(nursery):
    with trio.socket.socket() as listen_sock:
        await listen_sock.bind(("127.0.0.1", PORT))
        listen_sock.listen()
        while True:
            server_sock, _ = await listen_sock.accept()
            nursery.start_soon(echo_server, server_sock)

async def parent():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(echo_listener, nursery)

@pytest.mark.trio
async def test_server(nursery):
    nursery.start_soon(parent)
    await trio.sleep(0.5) # wait for server to listen
    with trio.socket.socket() as client_sock:
        await client_sock.connect(("127.0.0.1", PORT))
        await client_sock.send(b'hello, world')
        resp = await client_sock.recv(BUFSIZE)
        assert resp == b'hello, world'

As expected, the unittest fails - but the root cause is nowhere to be seen:

$ python3 -m pytest tests/lograce.py 
==================================== test session starts =====================================
platform linux -- Python 3.6.4, pytest-3.3.2, py-1.5.2, pluggy-0.6.0 -- /home/nikratio/consulting/iofabric/venv/bin/python3
cachedir: tests/.cache
rootdir: /home/nikratio/consulting/iofabric/gruagach/tests, inifile: pytest.ini
plugins: trio-0.3.0
collected 1 item                                                                             

tests/lograce.py::test_server FAILED                                                   [100%]

========================================== FAILURES ==========================================
________________________________________ test_server _________________________________________
Traceback (most recent call last):
  File "/home/nikratio/consulting/iofabric/gruagach/tests/lograce.py", line 41, in test_server
    assert resp == b'hello, world'
AssertionError: assert b'' == b'hello, world'
  Right contains more items, first extra item: 104
  Full diff:
  - b''
  + b'hello, world'
================================== 1 failed in 0.63 seconds ==================================

If we force the test to wait a moment before failing...

@pytest.mark.trio
async def test_server(nursery):
    nursery.start_soon(parent)
    await trio.sleep(0.5)  # wait for server to listen
    try:
        with trio.socket.socket() as client_sock:
            await client_sock.connect(("127.0.0.1", PORT))
            await client_sock.send(b'hello, world')
            resp = await client_sock.recv(BUFSIZE)
            assert resp == b'hello, world'
    finally:
        await trio.sleep(0.5)

...we get much better results:

$ python3 -m pytest tests/lograce.py 
[...]
collected 1 item                                                                             

tests/lograce.py::test_server FAILED                                                   [100%]

========================================== FAILURES ==========================================
________________________________________ test_server _________________________________________
Traceback (most recent call last):
  File "/home/nikratio/consulting/iofabric/gruagach/tests/lograce.py", line 42, in test_server
    assert resp == b'hello, world'
AssertionError: assert b'' == b'hello, world'
  Right contains more items, first extra item: 104
  Full diff:
  - b''
  + b'hello, world'
------------------------------------- Captured log call --------------------------------------
21:55:13.383 ERROR    lograce.echo_server: Things went wrong...
Traceback (most recent call last):
  File "/home/nikratio/consulting/iofabric/gruagach/tests/lograce.py", line 15, in echo_server
    raise RuntimeError('Damn!')
RuntimeError: Damn!
================================== 1 failed in 1.13 seconds ==================================

I believe the problem is that the failing test causes trio to cancel the server tasks before it has a chance to log its exception. Note that the call to trio.wait in echo_server is not necessay for the problem to occur, it just makes it much more likely.

I am not sure how to best fix this. Maybe the trio test runner could, if the test fails, give the other nursery task a short grace period in which they have a chance to report any errors before cancelling them?

Answer 1 · 2018-01-21T07:00:07.000Z

So first, as an FYI, your example code can be simplified a lot by using Trio's high-level networking interface – it's kinda new and not documented as well as it could be yet, so I'm not surprised you missed it, but this is definitely the recommended way to go:

async def echo_server(server_stream):
    try:
        # Normally, this 'with' isn't really necessary, because serve_tcp will automatically
        # close server_stream when the echo_server function returns. But it is needed to
        # trigger the problem that this issue is about :-).
        async with server_stream:
            await server_stream.receive_some(BUFSIZE)
            raise RuntimeError('Damn!')
    except Exception as exc:
        await trio.sleep(0.1) # make problem deterministic
        log.exception('Things went wrong...')

@pytest.mark.trio
async def test_server(nursery):
    await nursery.start(partial(trio.serve_tcp, PORT, echo_server, host="127.0.0.1"))
    # nursery.start and trio.serve_tcp coordinate to make sure that we don't continue
    # until the server is ready to accept connections, so no need to sleep here
    async with trio.open_tcp_stream("127.0.0.1", PORT) as client_stream:
        await client_stream.send_all(b'hello, world')
        resp = await client_stream.receive_some(BUFSIZE)
        assert resp == b'hello, world'

This also lets makes it easy to handle IPv6 and SSL by just changing the serve_tcp call. You could also let the OS automatically pick a free port by making it:

listeners = await nursery.start(partial(trio.serve_tcp, 0, echo_server, host="127.0.0.1"))
async with trio.testing.open_stream_to_socket_listener(listeners[0]) as client_stream:
    ...

I think I'm also required by the ancient laws governing socket programmers to point out that this test is broken in any case, because you can't assume that the call to receive_some will get all of the data back – even if the echo server was actually working, it could recieve b"he" first, and then b"llo world" later, or something like that. I know it's not the point here, but it's good to get in the habit of handling these things correctly, because it's the sort of thing that will seem to work in simple tests over localhost but then fail randomly in real life.

Anyway, none of that has to do with your actual problem, I just wanted to point that out :-).

So regarding your actual issue, what's happening here is:

The client sends b"hello world"
The server receives b"hello world", but then instead of echoing it back, it raises an exception.
The exception causes us to exit the with server_sock block, which closes the server's socket.
The server's except block then catches the exception, and blocks in the call to trio.sleep.
The server's socket being closed causes the client's call to recv to return an empty string b"", which just means that the server has hung up.
The client compares this to the string it was expecting, b"hello world", and since they don't match it raises an exception.
The test machinery gets this exception and cancels everything in the nursery fixture, including in particular our echo_server task.
The server's trio.sleep call raises Cancelled, so the log.exception line is never executed. The Cancelled exception does at least have the RuntimeError as its __context__...
...but then the Cancelled exception is absorbed by nursery fixture, instead of being printed, so the __context__ is lost.

This is tricky!

The key features of the server that make this happen seem to be: it raises an error, the error handling ends up sending a message to the client (in this case by closing the socket), and then after that the error handling tries to report something, but in between those it has a checkpoint where it can be cancelled.

So in your original example, the call to trio.sleep in the server actually is crucial, because without that there would be no checkpoint – the socket would be closed and then we'd immediately log the error, without giving the client a chance to run or checking for cancellation. But this isn't a general solution; in my revised version, you can get the issue without the sleep, because the async with introduces a checkpoint at the critical place.

Though you could fix that case too, by moving the async with outside of the try/except. But that's not a general solution either; assuming the server is running some arbitrary code, it may well have some kind of cleanup handlers inside that execute checkpoints, and that's all we need here for the original exception to get lost.

Really this problem is just inherent in how exceptions work. If you block inside an except block, then that needs to be cancellable, and when it's cancelled you'll probably lose the old exception, which is unfortunate if you were planning to re-raise it (like the __aexit__ handler would in async with stream), or do something else like log it (as in the original example). But we can't just not cancel a task that's blocked, and we can't tell what the task would have done with the original exception if we hadn't cancelled it. If we were in a language like Rust where errors propagation happens explicitly then this wouldn't be as much of an issue. But we can't require cancellations be propagated explicitly; without assistance from the language no-one would do it and we'd just have tons of programs that couldn't shut down.

Maybe what we really need here is some kind of better tracking of exception __context__? I'm going to open an issue in trio to consider that further.

I agree that the only plausible workaround at the pytest-trio level is do make the test running code look something like:

try:
    await actual_test_fn(**kwargs)
except Exception:
    await trio.sleep(0.1)
    raise

I am... really unsure about whether this is a good idea. Running tests fast is really important, and adding a sleep to every failure will slow down everyone, even though I suspect that this particular case is quite rare. Though that's another thing that makes me nervous: we have no idea how often this happens or any way to figure out whether this hack is actually helping. Is 0.1 seconds even long enough? Is it way too long? How do we know?

Answer 2 · 2018-01-21T16:32:36.000Z

Yeah, I know it can be simplified and improved. The goal was to quickly get a minimal example that reproduces the issue.

I agree, the core of the problem is actually not related to pytest at all but a result of the way that exceptions and nurseries work. If you have a nursery that starts two tasks, and the interaction of these tasks results in an exception being raised in each class, then only one exception is guaranteed to make it out of the nursery. The second one has to race against the Cancelled exception that's raised by the nursery.

That said, why wouldn't it be feasible to introduce an optional grace period for task cancellation - at least in the context of testing? In other words, if one task terminates with exception, wait up to TIMEOUT seconds for other tasks to raise exceptions too, and only afterwards send the cancellation request. There is no good value for TIMEOUT, so it should default to zero, but this would at least allow adjusting the timeout if one gets testing failures that don't make sense. I believe if this isn't implemented in trio proper, it would require some rather nasty monkeypatching to achieve, so having it upstream would be nice.

Answer 3 · 2018-01-22T11:11:07.000Z

It could be done, but I'd really rather avoid doing that if at all possible. Almost no-one's going to know that this mode exists, and for those who do, the only way to realize you should turn it on is if you've already figured out what's happening :-).

See python-trio/trio#416 for another idea.

Answer 4 · 2018-03-22T10:32:05.000Z

I got confronted to this trouble in my own project.

The solution I came up with is to wrap the socket used in the tests to catch the BrokenStreamError and replace it by a custom exception.

https://github.com/Scille/parsec-cloud/blob/4174ded193c563513c8d7b6aa745112d48986a30/tests/common.py#L45-L63

Then the code responsible to start the server and the connection to it is packaged in an async context manager that will catch the custom BrokenStreamError and wait forever instead of re-raising it.

https://github.com/Scille/parsec-cloud/blob/4174ded193c563513c8d7b6aa745112d48986a30/tests/common.py#L92-L115

This way we can be sure the exception coming from the server will be the one that will bubble up to the user.

Answer 5 · 2018-08-24T20:34:27.000Z

I think this is trio's issue to fix, so closing the pytest-trio version of this bug in favor of python-trio/trio#416 and python-trio/trio#611