fsspec/adlfs

Support virtual directory stubs with uppercase "Hdi_isfolder" metadata

IamJeffG opened this issue · 1 comments

Background

adlfs checks blobs for, among other things, the hdi_isfolder metadata key, to determine if the blob is a stub for a virtual directory, i.e. there are other files "nested under it."

adlfs/adlfs/spec.py

Lines 857 to 859 in 092685f

elif data["metadata"].get("hdi_isfolder") == "true":
data["type"] = "directory"
data["size"] = None

Problem

I have encountered some bugs in upstream blob writers where they create this stub blob with a capital-H Hdi_isfolder metadata key. Examples include:

  • Azure/azure-sdk-for-go#17850
  • I also see it happen when writing to Azure Blob Storage locations that are mounted as virtual volumes from inside containerized Azure Batch tasks (i.e. inside a Docker container), on a Batch Pool. I saw the capitalized metadata keys begin appearing between August 28 and August 31, 2023, and I have an internal Azure support ticket open for this.

Anyway, when the metadata key is Hdi_isfolder (capital-H), adfs does not detect that the blob is a directory stub, but rather thinks it's a file containing data. Then it errors:

import adlfs
import pyarrow
from pyarrow.dataset import dataset

abfs = adlfs.AzureBlobFileSystem(account_name="example", account_key="XXXX")
pqdata = dataset("path/to/stub", filesystem=abfs)
ArrowInvalid: Error creating dataset. Could not read schema from 'path/to/stub'.
  Is this a 'parquet' file?:
    Could not open Parquet input source 'path/to/stub':
      Parquet file size is 0 bytes

  File "pip/pyarrow==13.0.0/pyarrow/dataset.py", line 773, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "pip/pyarrow==13.0.0/pyarrow/dataset.py", line 466, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^

  File "pyarrow/_dataset.pyx", line 2941, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

Request

I am of the opinion that this is a bug in the writers of these stub blobs, and should be fixed there. However the damage has already been done and these stub blobs with capitalized metadata key already exist in the wild. So I am wondering if we can have adlfs become case-insensitive in its metadata check.

Beware

This bug is especially difficult to diagnose. One reason is that Azure Storage Explorer automatically shows all metadata names in lowercase, even if they are not actually in lowercase. I've found that I have to use the 'Containers' app in Azure Portal to view correct metadata names on stub blobs.

Thanks for the writeup. This is great context for the PR fixing this at #418.