GCSFS reports directory as FileNotFoundError when it exists. Run 1 fails, run 2 succeeds. Caching?
pascalwhoop opened this issue · 5 comments
Hi,
We went down a rabbit hole trying to find this one.
apache/arrow#31339
Turns out Pandas can't read partitioned parquet files from a directory because of PyArrow using GCSFS.
However in this repo there seems to be no mention of this. Are you aware of any situation where the library is non-deterministic/has caching issues when listing a directory?
import gcsfs
PATH = "bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes"
fs = gcsfs.GCSFileSystem()
print(fs.info(PATH))
print(fs.info(PATH))
print(fs.info(PATH))
Returns:
{'kind': 'storage#object', 'id': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes//1721313663057121', 'selfLink': 'https://www.googleapis.com/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F?generation=1721313663057121&alt=media', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes/', 'bucket': 'bucket-dev-storage', 'generation': '1721313663057121', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'COH5u4vpsIcDEAE=', 'timeCreated': '2024-07-18T14:41:03.059Z', 'updated': '2024-07-18T14:41:03.059Z', 'timeStorageClassUpdated': '2024-07-18T14:41:03.059Z', 'type': 'file'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
Note first call vs. 2 and 3 are different results. What's up with that?
You seem to have a key and a directory with the same name, which is unfortunate. While it is unclear which of these gcsfs should return with info(), I agree that it should be consistent.
For the original issue over in arrow, I can point out that the following works fine:
pd.read_parquet("bucket/partitioned.parq", filesystem=fs)
i.e., specifying the filesystem rather than providing a protocol prefix. (also, fastparquet has no problem with any of the possible forms!)
This is because every filesystem has its own internal convention of how to name paths, and apparently arrow is not using something like fsspec.url_to_fs to find what the root path should be or otherwise processing the path. Consider, for example, that "gcs" is the conventional prefix for gcsfs, but other frameworks (particularly the gcloud CLI) use "gs", so automatically "readding" the prefix isn't straight-forward.
(note that noone asked for my opinion in the upstream arrow thread)
Also: I am not able to reproduce your behaviour with or without a placeholder directory. Can you try to make a full reproducer, please?
That's curious, when you say "key and directory with the same name" does that mean we wrote that dir as a key first?
For context we're using pyspark to write a dataset to this path. I can imagine it creates a placeholder there first, although running spark against object storage is a pretty common scenario.
when you say "key and directory with the same name" does that mean we wrote that dir as a key first?
I can't say how it came to be, only that I suppose you have both a key called "bucket/path" and stuff with names like "bucket/path/..." which also implies the directory.
Now, as I say, I tried also making such a key, but did not see any problem doing info() on it afterward.