Support virtual directory stubs with uppercase "Hdi_isfolder" metadata
IamJeffG opened this issue · 1 comments
Background
adlfs checks blobs for, among other things, the hdi_isfolder
metadata key, to determine if the blob is a stub for a virtual directory, i.e. there are other files "nested under it."
Lines 857 to 859 in 092685f
Problem
I have encountered some bugs in upstream blob writers where they create this stub blob with a capital-H Hdi_isfolder
metadata key. Examples include:
- Azure/azure-sdk-for-go#17850
- I also see it happen when writing to Azure Blob Storage locations that are mounted as virtual volumes from inside containerized Azure Batch tasks (i.e. inside a Docker container), on a Batch Pool. I saw the capitalized metadata keys begin appearing between August 28 and August 31, 2023, and I have an internal Azure support ticket open for this.
Anyway, when the metadata key is Hdi_isfolder
(capital-H), adfs does not detect that the blob is a directory stub, but rather thinks it's a file containing data. Then it errors:
import adlfs
import pyarrow
from pyarrow.dataset import dataset
abfs = adlfs.AzureBlobFileSystem(account_name="example", account_key="XXXX")
pqdata = dataset("path/to/stub", filesystem=abfs)
ArrowInvalid: Error creating dataset. Could not read schema from 'path/to/stub'.
Is this a 'parquet' file?:
Could not open Parquet input source 'path/to/stub':
Parquet file size is 0 bytes
File "pip/pyarrow==13.0.0/pyarrow/dataset.py", line 773, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pip/pyarrow==13.0.0/pyarrow/dataset.py", line 466, in _filesystem_dataset
return factory.finish(schema)
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 2941, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
Request
I am of the opinion that this is a bug in the writers of these stub blobs, and should be fixed there. However the damage has already been done and these stub blobs with capitalized metadata key already exist in the wild. So I am wondering if we can have adlfs become case-insensitive in its metadata check.
Beware
This bug is especially difficult to diagnose. One reason is that Azure Storage Explorer automatically shows all metadata names in lowercase, even if they are not actually in lowercase. I've found that I have to use the 'Containers' app in Azure Portal to view correct metadata names on stub blobs.
Thanks for the writeup. This is great context for the PR fixing this at #418.