Explicit AsyncTransaction randomly ignores specified timeout

Question

Explicit AsyncTransaction randomly ignores specified timeout

v-v-d opened this issue 8 months ago · 3 comments

Bug Report

First of all, I want to thank you for the amazing Neo4j driver for Python with full asyncio support. It's really very cool and performs wonderfully! Thank you!

Now to the point. While using the driver, I've noticed that if explicit transactions are used and a timeout argument is passed at their initialization, the driver does not always respect this timeout. This leads to transactions containing queries that run for minutes when a timeout of just a few seconds is specified.

I have prepared a simple pytest test that confirms the above. You can run it via docker-compose up --build
neo4j_test.zip

Could you please tell me if this is a bug or expected behavior? Or maybe I'm doing something wrong?

This is a snippet of the test:

from typing import LiteralString, cast

import neo4j
import pytest
from _pytest.fixtures import SubRequest
from neo4j import AsyncGraphDatabase


@pytest.fixture()
async def driver() -> neo4j.AsyncDriver:
    async with AsyncGraphDatabase.driver(
        uri="neo4j://neo4j:7687",
        auth=("neo4j", "top-secret"),
    ) as driver:
        yield driver
        await driver.close()


@pytest.fixture()
async def session(driver: neo4j.AsyncDriver) -> neo4j.AsyncSession:
    session = driver.session(database="neo4j")
    yield session
    await session.close()


@pytest.fixture()
async def trx(request: SubRequest, session: neo4j.AsyncSession) -> neo4j.AsyncTransaction:
    trx = await session.begin_transaction(timeout=request.param)
    yield trx
    await trx.close()


@pytest.mark.parametrize("trx", [0.1, 0.3, 0.6, 0.9], indirect=True)
async def test_trx_timeout(trx: neo4j.AsyncTransaction) -> None:
    with pytest.raises(neo4j.exceptions.ClientError):
        await trx.run(query=cast(LiteralString, "CALL apoc.util.sleep(1000)"))
        await trx.commit()

and this is one of the outputs:

============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.1.1, pluggy-1.4.0
rootdir: /app/code
plugins: asyncio-0.23.6
asyncio: mode=Mode.AUTO
collected 4 items

app/test_neo4j.py .F.F                                                   [100%]

=================================== FAILURES ===================================
____________________________ test_trx_timeout[0.3] _____________________________

trx = <neo4j._async.work.transaction.AsyncTransaction object at 0xffff860d4a50>

    @pytest.mark.parametrize("trx", [0.1, 0.3, 0.6, 0.9], indirect=True)
    async def test_trx_timeout(trx: neo4j.AsyncTransaction) -> None:
>       with pytest.raises(neo4j.exceptions.ClientError):
E       Failed: DID NOT RAISE <class 'neo4j.exceptions.ClientError'>

app/test_neo4j.py:35: Failed
____________________________ test_trx_timeout[0.9] _____________________________

trx = <neo4j._async.work.transaction.AsyncTransaction object at 0xffff860e9c50>

    @pytest.mark.parametrize("trx", [0.1, 0.3, 0.6, 0.9], indirect=True)
    async def test_trx_timeout(trx: neo4j.AsyncTransaction) -> None:
>       with pytest.raises(neo4j.exceptions.ClientError):
E       Failed: DID NOT RAISE <class 'neo4j.exceptions.ClientError'>

app/test_neo4j.py:35: Failed
=========================== short test summary info ============================
FAILED app/test_neo4j.py::test_trx_timeout[0.3] - Failed: DID NOT RAISE <clas...
FAILED app/test_neo4j.py::test_trx_timeout[0.9] - Failed: DID NOT RAISE <clas...
========================= 2 failed, 2 passed in 3.63s ==========================

My Environment

Python Version: 3.11 (python3.11-slim docker image)
Driver Version: 5.18.0 Neo4j Bolt driver for Python
Server Version and Edition: 5.18.0 community (from neo4j:5.18.0-community docker image)
Operating System: debian bookworm (from python3.11-slim docker image)

Answer 1 · 2024-04-03T08:17:41.000Z

Hi and thanks for reaching out with this very detailed report 😄

This is, while maybe surprising, expected behavior. If you were to enable debug logging, you'd see that the driver will send the timeout to the server as you specified it. So it's the server's responsibility to enforce the transaction timeout. The driver doesn't process it any further. That being said, I remember being surprised by this behavior myself at some point and asked our DBMS kernel developers what's going on. (Implementation detail follows, i.e., this could change in the future) There is a watcher thread in the DBMS that checks every not and then:tm: (I think currently 2s) whether there are any transactions that have exceeded their configured timeout. If so, they will be terminated.

TL;DR: the timeout cannot be arbitrarily small and the server might cut transactions some slack before terminating them.

I'll close the issue as there is nothing to do here. But I'll forward this issue to team kernel so they know that at least 2 people were surprised by this behavior. Let's see if that leads to something. If you have further questions, please feel free to keep commenting.

Answer 2 · 2024-04-03T13:58:45.000Z

Hi!
Got it, thank you very much for the detailed response!
What do you think, could the use of the asyncio.timeout context manager be dangerous? Is it possible that on the client side a transaction is interrupted due to asyncio.TimeoutError, but on the Neo4j side, it remains open, thus generating anomaly load on Neo4j?

Answer 3 · 2024-04-04T08:06:34.000Z

Several thoughts:

If you're cancelling driver tasks (which the timeout would potentially do), you force the driver to hard close the socket should the task in question have one borrowed form the pool. That's because there's no room for awaiting any asynchronous calls, hence the connection's state cannot be cleaned up and gracefully put back into the pool. So that might result in a small performance hit.
Please double check that the new (Python 3.11) timeout implementation does not move the work into a new Task (I think it doesn't, but not 100% sure). If it does, make sure you're not ending up using concurrency-unsafe constructs (e.g., sessions, transactions, results) concurrently by accident. See also the note here in the API docs.
You could easily end up in a situation where your client cannot know whether the transaction has been completed or not. If you cancel after an explicit or implicit commit instruction was sent over the wire, the driver won't wait for the commit result. But maybe that's fine for you. This really depends on the use-case.
The server should kill transactions when the socket they were started from gets closed. So you shouldn't see lingering transactions. If you decide to go with this design, better verify yourself :)