Data Lake: Iterating over result of `get_paths` method on `FileSystemClient` raises HTTP error
ShivnarenSrinivasan opened this issue · 3 comments
- Package Name: azure.storage.filedatalake
- Package Version: 12.15.0
- Operating System: Windows 11
- Python Version: 3.12.3
Describe the bug
Calling the get_paths
method, and iterating over the result is throwing a HTTPResponseError
.
HttpResponseError: (InvalidQueryParameterValue) Value for one of the query parameters specified in the request URI is invalid.
To Reproduce
Steps to reproduce the behavior:
import os
from azure.storage.filedatalake DataLakeServiceClient
from azure.identity import ClientSecretCredential
ACCOUNT_NAME = os.getenv('ACCOUNT_NAME')
credential = ClientSecretCredential(os.getenv('TENANT_ID'), os.getenv('CLIENT_ID'), os.getenv('CLIENT_SECRET'))
service = DataLakeServiceClient(account_url=f"https://{ACCOUNT_NAME}.dfs.core.windows.net", credential=credential)
# this connection works for creating, deleting, and modifying files and directories
filesystem = service.get_file_system_client('data')
for path in filesystem.get_paths('tmp'):
print(path.name)
# raises http exception instead
Expected behavior
The iterator object returned by get_paths
should yield valid filesystem files/directories.
Screenshots
Error raised upon iteration:
Additional context
I do not believe this is a direct bug in the SDK, as I was able to replicate this issue while calling the underlying REST API directly--however, hoping there is some insight on the overall process.
In any case, if the method does not work as documented, perhaps some changes are necessary.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.
Hi @ShivnarenSrinivasan , thanks for the inquiry! After taking a closer look at your RequestId, we were able to come up with a successful repro of the error you are facing! The sample snippet code provided above actually does not face this issue, and so here is a seperate code example that should help explain what is the root cause.
Here is an example screenshot from my Azure Portal.
filesystemlevel
is created using the "+ Container" tooltip, and so is fundamentally different than any other hierarchical structure (i.e. folders, files, etc.) This is your actual File System
firstdirectorylevel
and all preceding structures are created using the "Upload" or "Add Directory" tooltip. These are thus not file systems, and instead are files or directories
With that being said, taking a look at your RequestId reveals that you are wrongly passing in a directory to the get_file_system_client
API.
For example, your code that reproduces the failure in this example would look like: service.get_file_system_client('filesystemlevel/firstdirectorylevel')
Whereas the correct code snippet would look like:
service.get_file_system_client('filesystemlevel')
Then, if your goal is to drill down to the paths in tmp
, you would pass:
filesystem.get_paths('firstdirectorylevel/seconddirectorylevel/tmp')
In short, the root cause of the issue is that you were specifying more than just the file system when getting a file system client. Hopefully this example makes sense and should unblock your workflow, otherwise please do not hesitate to reach out again!
Thanks!
Thanks a lot, @vincenttran-msft -- in an attempt to simplify the code I was working with, I seem to have left out the most critical detail. My apologies.
One piece to add, is I am part of an organization where I do not have admin privileges, and the directory I was trying to access was merely provisioned for me; hence I was unaware of the container/directory distinction.
This explanation is very helpful, and after making the changes, I'm good to go.
The issue I raised is certainly closed, but this does feel like a "gotcha" to an uninitiated user (esp. since the HTTPException is so generic).
Docs
I am not well versed with the terminology yet, but I couldn't find any specification of what a filesystem
represents, or the restrictions (i.e should be a container) in the docs (which I believe is the README in the relevant git directory). Would it be worth adding some? Happy to submit a PR, it could be in the README, or in the get_file_system
method itself.
Runtime Check
Further, I don't know if this is a correct assumption, but if containers cannot be nested--that means that the only valid argument to the get_file_system
call would be a root level path.
Would it be appropriate to add a runtime check to ensure there only a single path (i.e top level) is passed, rather than what I did?