`exists` method for the async blob client leaks file descriptors
GCBallesteros opened this issue · 0 comments
Which service(blob, file, queue) does this issue concern?
Async blob service. The following code can be used to reproduce the issue
import asyncio
from contextlib import AsyncExitStack
import os
import time
from azure.storage.blob.aio import BlobServiceClient
from dotenv import load_dotenv
load_dotenv()
async def check_if_blob_exists(container_client, blob):
# We sleep for a while to slow things down a give us an opportunity
# to monitor file descriptor use
time.sleep(0.01)
blob_client = container_client.get_blob_client(blob=blob)
if await blob_client.exists():
return True
else:
return False
async def gather_with_concurrency(n, *tasks):
"""A gather function that limits the concurrency to avoid overloading the backend.
Params
------
n: int
The max number of concurrent coroutines that can be run
tasks:
The futures we want to execute
"""
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(
*(sem_task(task) for task in tasks), return_exceptions=True
)
async def eat_file_descriptor(container_client):
blob_name = "some_blob_name"
_ = await check_if_blob_exists(
container_client,
blob=blob_name,
)
async def main():
async with AsyncExitStack() as stack:
blob_service_client = await stack.enter_async_context(
BlobServiceClient.from_connection_string(
os.environ["BLOB_STORAGE_CONN_STR"]
)
)
container_client = blob_service_client.get_container_client(
os.environ["CONTAINER"]
)
# Create the futures and gather them
results = await gather_with_concurrency(
int(os.environ["MAX_CONCURRENCY_CONSOLIDATE"]),
*[eat_file_descriptor(container_client) for _ in range(40000)],
)
return results
if __name__ == "__main__":
res = asyncio.run(main())
Which version of the SDK was used? Please provide the output of pip freeze
.
Running Python 3.8.6 under WSL2. My per process limit on file descriptors is 1024
aiohttp==3.4.4
appdirs==1.4.4
asgiref==3.2.10
async-timeout==3.0.1
attrs==21.2.0
azure-core==1.16.0
azure-identity==1.5.0
azure-kusto-data==2.3.0
azure-storage-blob==12.8.1
black==21.7b0
certifi==2021.5.30
cffi==1.14.6
chardet==3.0.4
charset-normalizer==2.0.3
click==8.0.1
cryptography==3.4.7
idna==3.2
isodate==0.6.0
msal==1.9.0
msal-extensions==0.3.0
msrest==0.6.21
multidict==4.7.6
mypy-extensions==0.4.3
numpy==1.21.1
oauthlib==3.1.1
pandas==1.2.5
pathspec==0.9.0
portalocker==1.7.1
pyarrow==4.0.1
pycparser==2.20
PyJWT==2.1.0
python-dateutil==2.8.2
python-dotenv==0.19.0
pytz==2021.1
regex==2021.7.6
requests==2.26.0
requests-oauthlib==1.3.0
river==0.7.1
scipy==1.7.0
six==1.16.0
structlog==21.1.0
tenacity==8.0.1
tomli==1.1.0
urllib3==1.26.6
yarl==1.6.3
What problem was encountered?
The exists
method for azure.storage.blob.aio._blob_client_async.BlobClient
leaks file descriptors. If a big number of futures that make use of the method are launched the per process limit for open files kicks in real quick. From that point on everything grinds to a halt and OS too many open file errors start popping up all over the place.
I monitor file descriptor use with the following command. It will print the processes with the highest file descriptor usage.
for pid in
ps -o pid -u some_user ; do echo "$(ls /proc/$pid/fd/ 2>/dev/null | wc -l ) for PID: $pid" ; done | sort -n | tail
Have you found a mitigation/solution?
Yes, not using the exists
method. I surround the SDK calls on a try/except block that raises when the blob is not there. Using exceptions control for flow control is not ideal but it saved the day here.