Azure/azure-sdk-for-python

Data Lake: Iterating over result of `get_paths` method on `FileSystemClient` raises HTTP error

ShivnarenSrinivasan opened this issue · 3 comments

Describe the bug
Calling the get_paths method, and iterating over the result is throwing a HTTPResponseError.

HttpResponseError: (InvalidQueryParameterValue) Value for one of the query parameters specified in the request URI is invalid.

To Reproduce
Steps to reproduce the behavior:

import os
from azure.storage.filedatalake DataLakeServiceClient
from azure.identity import ClientSecretCredential

ACCOUNT_NAME = os.getenv('ACCOUNT_NAME')

credential = ClientSecretCredential(os.getenv('TENANT_ID'), os.getenv('CLIENT_ID'), os.getenv('CLIENT_SECRET'))
service = DataLakeServiceClient(account_url=f"https://{ACCOUNT_NAME}.dfs.core.windows.net", credential=credential)
# this connection works for creating, deleting, and modifying files and directories

filesystem = service.get_file_system_client('data')
for path in filesystem.get_paths('tmp'):
    print(path.name)
    # raises http exception instead

Expected behavior
The iterator object returned by get_paths should yield valid filesystem files/directories.

Screenshots
Error raised upon iteration:
image

Additional context
I do not believe this is a direct bug in the SDK, as I was able to replicate this issue while calling the underlying REST API directly--however, hoping there is some insight on the overall process.
In any case, if the method does not work as documented, perhaps some changes are necessary.

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

Hi @ShivnarenSrinivasan , thanks for the inquiry! After taking a closer look at your RequestId, we were able to come up with a successful repro of the error you are facing! The sample snippet code provided above actually does not face this issue, and so here is a seperate code example that should help explain what is the root cause.

image
Here is an example screenshot from my Azure Portal.

  • filesystemlevel is created using the "+ Container" tooltip, and so is fundamentally different than any other hierarchical structure (i.e. folders, files, etc.) This is your actual File System
    image
  • firstdirectorylevel and all preceding structures are created using the "Upload" or "Add Directory" tooltip. These are thus not file systems, and instead are files or directories
    image

With that being said, taking a look at your RequestId reveals that you are wrongly passing in a directory to the get_file_system_client API.

For example, your code that reproduces the failure in this example would look like: service.get_file_system_client('filesystemlevel/firstdirectorylevel')

Whereas the correct code snippet would look like:
service.get_file_system_client('filesystemlevel')

Then, if your goal is to drill down to the paths in tmp, you would pass:
filesystem.get_paths('firstdirectorylevel/seconddirectorylevel/tmp')

In short, the root cause of the issue is that you were specifying more than just the file system when getting a file system client. Hopefully this example makes sense and should unblock your workflow, otherwise please do not hesitate to reach out again!

Thanks!

Thanks a lot, @vincenttran-msft -- in an attempt to simplify the code I was working with, I seem to have left out the most critical detail. My apologies.
One piece to add, is I am part of an organization where I do not have admin privileges, and the directory I was trying to access was merely provisioned for me; hence I was unaware of the container/directory distinction.

This explanation is very helpful, and after making the changes, I'm good to go.
The issue I raised is certainly closed, but this does feel like a "gotcha" to an uninitiated user (esp. since the HTTPException is so generic).

Docs

I am not well versed with the terminology yet, but I couldn't find any specification of what a filesystem represents, or the restrictions (i.e should be a container) in the docs (which I believe is the README in the relevant git directory). Would it be worth adding some? Happy to submit a PR, it could be in the README, or in the get_file_system method itself.

Runtime Check

Further, I don't know if this is a correct assumption, but if containers cannot be nested--that means that the only valid argument to the get_file_system call would be a root level path.
Would it be appropriate to add a runtime check to ensure there only a single path (i.e top level) is passed, rather than what I did?