fsspec/adlfs

Support for storage accounts in the URL (Azure Data Lake Storage Gen2 URI support)

aucampia opened this issue · 3 comments

Azure Data Lake Storage Gen2 URIs are described as follow [ref]:

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>

Scheme identifier: The abfs protocol is used as the scheme identifier. If you add an 's' at the end (abfss) then the ABFS Hadoop client driver will ALWAYS use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication then the client driver will always use TLS even if you specify 'abfs' instead of 'abfss' because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver will interpret 'abfs' to mean that you do not want to use TLS.

File system: The parent location that holds the files and folders. This is the same as Containers in the Azure Storage Blobs service.

Account name: The name given to your storage account during creation.

Paths: A forward slash delimited (/) representation of the directory structure.

File name: The name of the individual file. This parameter is optional if you are addressing a directory.

This supports the storage account name in the URI, which makes it much more versatile than having to provide it out of band.

Would you be open to support for this?

It would be great to have support for abfss too!

What's the actual work to be done here? IIUC, can we just need to parse that URI and extract the account URL and container name used by the existing implementation? You can use regular azure-storage-blob to work with these kinds of containers? Or do we need to use another API to talk to Azure?

If we can use azure.storage.blob, then I think we would need to

  • Decide on whether to overload the existing AzureBlobFileSystem.__init__ to also accept being called with this URI type. That's probably the most convenient for users but might be a bit tricky to implement (it kinda clashes with the current implementation). Maybe it'd be best to have a separate FileSystem class that handles this URI, which internally uses AzureBlobFileSystem?
  • Register abfs[s] as prefixes with fsspec
  • tests

Faced the same issue the other day... so amended the AzureBlobFileSystem._strip_protocol method to be able to handle the azure blob storage host name. Here's a suggestion:

def _strip_protocol(cls, path: str):
    """
    Remove the protocol from the input path

    Parameters
    ----------
    path: str
        Path to remove the protocol from

    Returns
    -------
    str
        Returns a path without the protocol
    """
    if isinstance(path, list):
        return [cls._strip_protocol(p) for p in path]

    STORE_SUFFIXES = [".blob.core.windows.net", ".dfs.core.windows.net"]
    logger.debug(f"_strip_protocol for {path}")
    if not path.startswith(("abfs://", "az://", "abfss://")):
        path = path.lstrip("/")
        path = "abfs://" + path
    ops = infer_storage_options(path)
    if "username" in ops:
        if ops.get("username", None):
            ops["path"] = ops["username"] + ops["path"]
    # we need to make sure that the path retains
    # the format {host}/{path}
    # here host is the container_name
    elif ops.get("host", None):
        if (
            not any(ops["host"].endswith(s) for s in STORE_SUFFIXES)
        ):  # no store-suffix, so this is container-name
            ops["path"] = ops["host"] + ops["path"]
    url_query = ops.get("url_query")
    if url_query is not None:
        ops["path"] = f"{ops['path']}?{url_query}"

    logger.debug(f"_strip_protocol({path}) = {ops}")
    stripped_path = ops["path"].lstrip("/")
    return stripped_path