Global concurrency makes script more than twice slower than with a local sempahore
Opened this issue · 4 comments
Bug summary
version 1 with semaphore
import asyncio
from prefect import task, flow
sem = asyncio.Semaphore(100)
@task
async def print_value(value):
async with sem:
await asyncio.sleep(10)
print(value)
@flow
async def async_flow():
print("Hello, I'm an async flow")
# runs concurrently
coros = [print_value(i) for i in range(1000)]
await asyncio.gather(*coros)
if __name__ == "__main__":
asyncio.run(async_flow())
This takes 1m 42s
And here the concurrency version with a global concurrency limit called test with 100 slots
import asyncio
from prefect import task, flow
from prefect.concurrency.asyncio import concurrency
@task
async def print_value(value):
async with concurrency("test"):
await asyncio.sleep(10)
print(value)
@flow
async def async_flow():
print("Hello, I'm an async flow")
# runs concurrently
coros = [print_value(i) for i in range(1000)]
await asyncio.gather(*coros)
if __name__ == "__main__":
asyncio.run(async_flow())
It takes 4m 6s
This is more than twice as slow.
Version info
Version: 3.1.0
API version: 0.8.4
Python version: 3.12.4
Git commit: a83ba39b
Built: Thu, Oct 31, 2024 12:43 PM
OS/Arch: darwin/arm64
Profile: local
Server type: cloud
Pydantic version: 2.8.2
Integrations:
prefect-gcp: 0.6.1
Additional context
No response
What region of the world, roughly are you located? (Trying to figure out if it's just a network latency thing)
This is an interesting point. The numbers were from my local machine which is in Israel. So i tried to run the flow from a K8S cluster in the US and the time is still not so good. It is 3m and 56s. So in the k8s clusteer it was about 6s faster but still very slow
Hi @david-gang is there a reason you expect a distributed lock that works across machines and environments to be roughly the same speed as a lock that only works within a single thread on a single machine? We can look into motivations for speeding this up for sure, but comparing it to asyncio.Semaphore
isn't an apples-to-apples comparison.
hi @cicdw , I understand that I cannot compare a local lock with a distributed lock. I am looking from the perspective on how much time I loose when I use global concurrency and I need some baseline, even if it is not perfect. What do you think is the baseline which could be compared? What is the ideal distributed lock which just exists in theory on cs books, and what is his theoretical overhead?
According to: https://docs.prefect.io/v3/develop/global-concurrency-limits#manage-database-connections
"Manage the maximum number of concurrent database connections to avoid exhausting database resources."
But if every query takes 10 seconds and due to the lock another 10 seconds get wasted, this is something which needs to be mentioned. So first and foremost the question is on how much time I am going to pay if I use this feature. I would expect to pay between 10-30% more time, but more than twice the time looks for me too much.
Additionally I think that there is more place to improve the concurrency. The current code acquires and releases the lock for every item in isolation. Maybe it would be better to see if there is a waiting item and holding the lock longer. Maybe starvation can be dealt in a more sophisticated manner.
Additionally requests to acquire slot may be windowed, so that we request for x items together which were sent in the same second (time window)
Thanks a lot