`exists` method for the async blob client leaks file descriptors #696

GCBallesteros · 2021-08-13T17:39:56Z

Which service(blob, file, queue) does this issue concern?

Async blob service. The following code can be used to reproduce the issue

import asyncio
from contextlib import AsyncExitStack
import os
import time

from azure.storage.blob.aio import BlobServiceClient

from dotenv import load_dotenv


load_dotenv()


async def check_if_blob_exists(container_client, blob):
    # We sleep for a while to slow things down a give us an opportunity
    # to monitor file descriptor use
    time.sleep(0.01)


    blob_client = container_client.get_blob_client(blob=blob)
    if await blob_client.exists():
        return True
    else:
        return False

async def gather_with_concurrency(n, *tasks):
    """A gather function that limits the concurrency to avoid overloading the backend.

    Params
    ------
    n: int
        The max number of concurrent coroutines that can be run
    tasks:
        The futures we want to execute
    """
    semaphore = asyncio.Semaphore(n)

    async def sem_task(task):
        async with semaphore:
            return await task

    return await asyncio.gather(
        *(sem_task(task) for task in tasks), return_exceptions=True
    )

async def eat_file_descriptor(container_client):
    blob_name = "some_blob_name"
    _ = await check_if_blob_exists(
        container_client,
        blob=blob_name,
    )


async def main():
    async with AsyncExitStack() as stack:
        blob_service_client = await stack.enter_async_context(
            BlobServiceClient.from_connection_string(
                os.environ["BLOB_STORAGE_CONN_STR"]
            )
        )

        container_client = blob_service_client.get_container_client(
            os.environ["CONTAINER"]
        )

        # Create the futures and gather them
        results = await gather_with_concurrency(
            int(os.environ["MAX_CONCURRENCY_CONSOLIDATE"]),
            *[eat_file_descriptor(container_client) for _ in range(40000)],
        )

    return results


if __name__ == "__main__":
    res = asyncio.run(main())

Which version of the SDK was used? Please provide the output of `pip freeze`.

Running Python 3.8.6 under WSL2. My per process limit on file descriptors is 1024

aiohttp==3.4.4
appdirs==1.4.4
asgiref==3.2.10
async-timeout==3.0.1
attrs==21.2.0
azure-core==1.16.0
azure-identity==1.5.0
azure-kusto-data==2.3.0
azure-storage-blob==12.8.1
black==21.7b0
certifi==2021.5.30
cffi==1.14.6
chardet==3.0.4
charset-normalizer==2.0.3
click==8.0.1
cryptography==3.4.7
idna==3.2
isodate==0.6.0
msal==1.9.0
msal-extensions==0.3.0
msrest==0.6.21
multidict==4.7.6
mypy-extensions==0.4.3
numpy==1.21.1
oauthlib==3.1.1
pandas==1.2.5
pathspec==0.9.0
portalocker==1.7.1
pyarrow==4.0.1
pycparser==2.20
PyJWT==2.1.0
python-dateutil==2.8.2
python-dotenv==0.19.0
pytz==2021.1
regex==2021.7.6
requests==2.26.0
requests-oauthlib==1.3.0
river==0.7.1
scipy==1.7.0
six==1.16.0
structlog==21.1.0
tenacity==8.0.1
tomli==1.1.0
urllib3==1.26.6
yarl==1.6.3

What problem was encountered?

The exists method for azure.storage.blob.aio._blob_client_async.BlobClient leaks file descriptors. If a big number of futures that make use of the method are launched the per process limit for open files kicks in real quick. From that point on everything grinds to a halt and OS too many open file errors start popping up all over the place.

I monitor file descriptor use with the following command. It will print the processes with the highest file descriptor usage.
for pid in ps -o pid -u some_user ; do echo "$(ls /proc/$pid/fd/ 2>/dev/null | wc -l ) for PID: $pid" ; done | sort -n | tail

Have you found a mitigation/solution?

Yes, not using the exists method. I surround the SDK calls on a try/except block that raises when the blob is not there. Using exceptions control for flow control is not ideal but it saved the day here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`exists` method for the async blob client leaks file descriptors #696

`exists` method for the async blob client leaks file descriptors #696

GCBallesteros commented Aug 13, 2021

exists method for the async blob client leaks file descriptors #696

exists method for the async blob client leaks file descriptors #696

Comments

GCBallesteros commented Aug 13, 2021

Which service(blob, file, queue) does this issue concern?

Which version of the SDK was used? Please provide the output of pip freeze.

What problem was encountered?

Have you found a mitigation/solution?

`exists` method for the async blob client leaks file descriptors #696

`exists` method for the async blob client leaks file descriptors #696

Which version of the SDK was used? Please provide the output of `pip freeze`.