Global concurrency makes script more than twice slower than with a local sempahore

Question

Global concurrency makes script more than twice slower than with a local sempahore

Opened this issue 17 days ago · 4 comments

Bug summary

version 1 with semaphore

import asyncio
from prefect import task, flow

sem = asyncio.Semaphore(100)

@task
async def print_value(value):
    async with sem:
        await asyncio.sleep(10)
        print(value)


@flow
async def async_flow():
    print("Hello, I'm an async flow")

    # runs concurrently
    coros = [print_value(i) for i in range(1000)]
    await asyncio.gather(*coros)


if __name__ == "__main__":
    asyncio.run(async_flow())

This takes 1m 42s

And here the concurrency version with a global concurrency limit called test with 100 slots

import asyncio
from prefect import task, flow
from prefect.concurrency.asyncio import concurrency


@task
async def print_value(value):
    async with concurrency("test"):
        await asyncio.sleep(10)
        print(value)


@flow
async def async_flow():
    print("Hello, I'm an async flow")

    # runs concurrently
    coros = [print_value(i) for i in range(1000)]
    await asyncio.gather(*coros)


if __name__ == "__main__":
    asyncio.run(async_flow())

It takes 4m 6s

This is more than twice as slow.

Version info

Version:             3.1.0
API version:         0.8.4
Python version:      3.12.4
Git commit:          a83ba39b
Built:               Thu, Oct 31, 2024 12:43 PM
OS/Arch:             darwin/arm64
Profile:             local
Server type:         cloud
Pydantic version:    2.8.2
Integrations:
  prefect-gcp:       0.6.1

Additional context

No response

Answer 1 · 2024-11-17T18:20:59.000Z

What region of the world, roughly are you located? (Trying to figure out if it's just a network latency thing)

Answer 2 · 2024-11-17T19:58:25.000Z

This is an interesting point. The numbers were from my local machine which is in Israel. So i tried to run the flow from a K8S cluster in the US and the time is still not so good. It is 3m and 56s. So in the k8s clusteer it was about 6s faster but still very slow

Answer 3 · 2024-11-19T16:43:38.000Z

Hi @david-gang is there a reason you expect a distributed lock that works across machines and environments to be roughly the same speed as a lock that only works within a single thread on a single machine? We can look into motivations for speeding this up for sure, but comparing it to asyncio.Semaphore isn't an apples-to-apples comparison.

Answer 4 · 2024-11-19T21:18:53.000Z

hi @cicdw , I understand that I cannot compare a local lock with a distributed lock. I am looking from the perspective on how much time I loose when I use global concurrency and I need some baseline, even if it is not perfect. What do you think is the baseline which could be compared? What is the ideal distributed lock which just exists in theory on cs books, and what is his theoretical overhead?

According to: https://docs.prefect.io/v3/develop/global-concurrency-limits#manage-database-connections

"Manage the maximum number of concurrent database connections to avoid exhausting database resources."

But if every query takes 10 seconds and due to the lock another 10 seconds get wasted, this is something which needs to be mentioned. So first and foremost the question is on how much time I am going to pay if I use this feature. I would expect to pay between 10-30% more time, but more than twice the time looks for me too much.

Additionally I think that there is more place to improve the concurrency. The current code acquires and releases the lock for every item in isolation. Maybe it would be better to see if there is a waiting item and holding the lock longer. Maybe starvation can be dealt in a more sophisticated manner.

Additionally requests to acquire slot may be windowed, so that we request for x items together which were sent in the same second (time window)

Thanks a lot