[Python] pyarrow.parquet.read_table either returns FileNotFound or ArrowInvalid
Opened this issue · 13 comments
running below results in "GetFileIno() yielded path 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "
import pyarrow.parquet as pq
import gcsfs
file_path="gs://myBucket/features/MyParquet.parquet/"
fs=gcsfs.GCSFileSystem()
table=pq.read_table(file_path,filesystem=fs)
Removing the gs:// from file_path results in a FileNotFoundError
. Any variation of / or // at the beginning of the path gives me the 'outside base dir' error.
I also ran the below and got valid results using both file_path patterns, so I know it finds the path just fine.
from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
filesys = PyFileSystem(FSSpecHandler(fs))
selector = FileSelector(file_path, recursive=True)
filesys.get_file_info(selector)
Environment: GCP JupyterLab notebooks
Reporter: Callista Rogers
Note: This issue was originally created as ARROW-15910. Please see the migration documentation for further details.
Antoine Pitrou / @pitrou:
Can you try with PyArrow 7.0.0?
Also cc @jorisvandenbossche
Callista Rogers:
Hey Antoine,
Same results in both tests with PyArrow 7.0.0 as well
Joris Van den Bossche / @jorisvandenbossche:
The first error is expected I think, because you can either pass a URI (with "gs://") or either a path + filesystem object (in which case the file path cannot contain "gs://"). And since we do not yet expose GCS filesystem discovery in python, the only option is to do it via the fsspec filesystem, and thus you have to pass a file path and not URI.
Removing the gs:// from file_path results in a FileNotFoundError. Any variation of / or // at the beginning of the path gives me the 'outside base dir' error.
So this would then be expected to work (remove the gs://).
Can you show the full error traceback from pq.read_table(file_path,filesystem=fs)
(with file_path without gs://)?
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_1684296/1250703457.py in <module>
3 file_path="MyBucket/path/Name_of_parquet.parquet/"
4 fs=gcsfs.GCSFileSystem()
----> 5 table=pq.read_table(file_path,filesystem=fs)
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
1968 ignore_prefixes=ignore_prefixes,
1969 pre_buffer=pre_buffer,
-> 1970 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
1971 )
1972 except ImportError:
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
1782 format=parquet_format,
1783 partitioning=partitioning,
-> 1784 ignore_prefixes=ignore_prefixes)
1785
1786 @property
/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
665
666 if _is_path_like(source):
--> 667 return _filesystem_dataset(source, **kwargs)
668 elif isinstance(source, (tuple, list)):
669 if all(_is_path_like(elem) for elem in source):
/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
420 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
421
--> 422 return factory.finish(schema)
423
424
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.DatasetFactory.finish()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()
/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py in open_input_file(self, path)
392
393 if not self.fs.isfile(path):
--> 394 raise FileNotFoundError(path)
395
396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
Joris Van den Bossche / @jorisvandenbossche:
Could you also show the output of
import gcsfs
file_path = "MyBucket/path/Name_of_parquet.parquet/"
fs = gcsfs.GCSFileSystem()
print(fs.info(flle_path))
from pyarrow.fs import PyFileSystem, FSSpecHandler
pa_fs = PyFileSystem(FSSpecHandler(fs))
print(pa_fs.get_file_info(file_path))
Callista Rogers:
from fs.info(file_path)
:
{'bucket': 'MyBucket', 'name': 'MyBucket/path/Name_of_parquet.parquet/', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
from pa_fs.get_file_info(file_path))
<FileInfo for 'MyBucket/path/Name_of_parquet.parquet/': type=FileType.Directory>
Joris Van den Bossche / @jorisvandenbossche:
[~crogers923]
thanks for the quick follow-up.
It's strange that it correctly sees a directory, but then the actual reading fails with "FIleNotFound" (thinking it is a file, not a directory).
But I remember now that a few weeks ago we had a similar issue on the user mailing list with gcsfs giving such error (see my answer at https://lists.apache.org/thread/d0fccn94ovt2hh6cgyktcvz127x5pysw). In that case, it mattered whether you called the "info" method the first or the second time. Can you check that here as well? The above output that you show, is that the output you get when running that the first time? (after restarting the interactive (console) session)
Callista Rogers:
Oh that's interesting. The first time I run fs.info, I get:
{'kind': 'storage#object', 'id': 'MyBucket/path/name_of_parquet.parquet//1646930508287024', 'selfLink': 'https://www.googleapis.com/storage/v1/b/MyBucket%2Fpath/name_of_parquet.parquet%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/MyBucket/o/path%2Fname_of_parquet.parquet%2F?generation=1646930508287024&alt=media', 'name': 'MyBucket/path/name_of_parquet.parquet/', 'bucket': 'MyBucket', 'generation': '1646930508287024', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'CLCYqp/+u/YCEAE=', 'timeCreated': '2022-03-10T16:41:48.428Z', 'updated': '2022-03-10T16:41:48.428Z', 'timeStorageClassUpdated': '2022-03-10T16:41:48.428Z', 'type': 'file'}
from pa_fs.get_file_info(file_path))
the first time
<FileInfo for 'myBucket/features/MyParquet.parquet/': type=FileType.File, size=0>
Joris Van den Bossche / @jorisvandenbossche:
OK, so it's the same bug in gcsfs from the mailing list thread that you are running into: because gcsfs returns something different the first vs subsequent times, pyarrow is confused about whether it is querying a file or directory. I know that gcsfs uses size 0 empty files to mimic directories (since the filesystem itself doesn't have that concept), but I think it can still be expected from gcsfs to return a consistent info for the same file path.
I would open an issue about this on the gcsfs side (I don't think that happened after the previous mailing list thread).
If you first do such an initial "info" call, and then do the read_table
call (which now should directly think the file path is a directory), does it then work correctly?
Callista Rogers:
It doesn't work, I still get the FileNotFound error.
I'll open a ticket with gcsfs. Thanks!
Joris Van den Bossche / @jorisvandenbossche:
It doesn't work, I still get the FileNotFound error.
That's strange, I expected that that would be a workaround. Does it give the exact same error message and traceback? Can you show the output of doing the get_file_info
twice (so it shows the file path is a directory the second time), and then passing that path to the read_table
?
Callista Rogers:
Running it the first time on a fresh kernel I actually get a shorter traceback. The above matches the second time running traceback.
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_48702/3635827774.py in <module>
----> 1 table=pq.read_table(file_path,filesystem=fs)
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
1968 ignore_prefixes=ignore_prefixes,
1969 pre_buffer=pre_buffer,
-> 1970 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
1971 )
1972 except ImportError:
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
1764
1765 self._dataset = ds.FileSystemDataset(
-> 1766 [fragment], schema=fragment.physical_schema,
1767 format=parquet_format,
1768 filesystem=fragment.filesystem
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()
/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py in open_input_file(self, path)
392
393 if not self.fs.isfile(path):
--> 394 raise FileNotFoundError(path)
395
396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
I think these are related -- starburstdata/dbt-trino#226
is there a way for pyarrow to ignore the '0' (i.e s3 dir files) of a directory