encode/httpcore

Race Condition After Async Cancellations Breaks Connection Pool

Opened this issue · 2 comments

  • Initially raised as discussion #952

This bug occurs after certain async cancellations, and causes the HTTP pool to get stuck in a degraded state, unable to issue requests. It looks like a resurfacing of this issue.

httpcore version: 1.0.5
python version: 3.12.4

Code to reproduce:

import random
import asyncio
import httpcore
from collections import defaultdict

# Global dictionary to store status code counts
status_counts = defaultdict(int)


async def make_request(pool):
    try:
        response = await pool.request(
            method="GET",
            url=("https://localhost:8000/"),
        )
        status_counts[response.status] += 1
    except BaseException:
        status_counts[0] += 1


async def spawn_and_cancel_tasks(pool):
    while True:
        tasks = [asyncio.create_task(make_request(pool)) for _ in range(100)]
        await asyncio.sleep(random.random())  # Allow some time for the tasks to start
        for task in tasks:
            task.cancel()
        await asyncio.gather(*tasks, return_exceptions=True)


async def monitor_stats(pool):
    while True:
        status_str = ", ".join(f"{code}: {count}" for code, count in sorted(status_counts.items()))
        print(f"Status codes: {status_str} | Pool {pool}")
        status_counts.clear()
        await asyncio.sleep(1)


async def main():

    async with httpcore.AsyncConnectionPool() as pool:
        stats_task = asyncio.create_task(monitor_stats(pool))
        spawn_task = asyncio.create_task(spawn_and_cancel_tasks(pool))

        try:
            await asyncio.gather(stats_task, spawn_task)
        except KeyboardInterrupt:
            print("Stopping...")

if __name__ == "__main__":
    asyncio.run(main())

Output:

Status codes:  | Pool <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
Status codes: 200: 200 | Pool <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 10 idle]>
Status codes: 0: 202, 200: 139 | Pool <AsyncConnectionPool [Requests: 62 active, 0 queued | Connections: 8 active, 2 idle]>
Status codes: 0: 142, 200: 117 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 0 idle]>
Status codes: 0: 35, 200: 165 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 5 idle]>
Status codes: 200: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 5 active, 5 idle]>
Status codes: 0: 155, 200: 45 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 300 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]>
Status codes:  | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]> <- stops responding here
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 0 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 100 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 200 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>
Status codes: 0: 300 | Pool <AsyncConnectionPool [Requests: 5 active, 100 queued | Connections: 10 active, 0 idle]>

You can see from the above output the point where the pool stops issuing new requests. This is reproducible using the default python http server:

echo 'a' > index.html
python -m http.server

The bug applies to both asyncio and trio. However in trio it's much narrower. With asyncio it triggers almost immediately, while trio takes about 30 seconds.

Code to reproduce in trio:

import random
import trio
import httpcore
from collections import defaultdict

# Global dictionary to store status code counts
status_counts = defaultdict(int)


async def make_request(pool):
    try:
        response = await pool.request(
            method="GET",
            url="http://localhost:8000/",
        )
        status_counts[response.status] += 1
    except BaseException:
        status_counts[0] += 1


async def spawn_and_cancel_tasks(pool):
    while True:
        with trio.CancelScope() as cancel_scope:
            async with trio.open_nursery() as nursery:
                for _ in range(100):
                    nursery.start_soon(make_request, pool)
                await trio.sleep(random.random())  # Allow some time for the tasks to start
                cancel_scope.cancel()


async def monitor_stats(pool):
    while True:
        status_str = ", ".join(f"{code}: {count}" for code, count in sorted(status_counts.items(), reverse=True))
        print(f"Status codes: {status_str} | Pool {pool}")
        # Clear the counts for the next iteration
        status_counts.clear()
        await trio.sleep(1)


async def main():

    async with httpcore.AsyncConnectionPool() as pool:
        async with trio.open_nursery() as nursery:
            nursery.start_soon(monitor_stats, pool)
            await spawn_and_cancel_tasks(pool)


if __name__ == "__main__":
    try:
        trio.run(main)
    except KeyboardInterrupt:
        print("Stopping...")

Sorry to bring this up again; I know these types of issues suck to troubleshoot. But it's been causing issues for us in production. Please let me know if there's anything else I can do to help.

also seeing this on httpcore 1.0.7

@TheTechromancer @nikcaryo-super

There is a PR that must fix the issue: #986. Any feedback whether it solves the problem is welcome.