apache/arrow

[Python] pyarrow.parquet.read_table either returns FileNotFound or ArrowInvalid

Opened this issue · 13 comments

running below results in "GetFileIno() yielded path 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "

import pyarrow.parquet as pq
import gcsfs
file_path="gs://myBucket/features/MyParquet.parquet/"
fs=gcsfs.GCSFileSystem()
table=pq.read_table(file_path,filesystem=fs

Removing the gs:// from file_path results in a FileNotFoundError. Any variation of / or // at the beginning of the path gives me the 'outside base dir' error.

I also ran the below and got valid results using both file_path patterns, so I know it finds the path just fine.

from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
filesys = PyFileSystem(FSSpecHandler(fs))
selector = FileSelector(file_path, recursive=True)
filesys.get_file_info(selector)

Environment: GCP JupyterLab notebooks
Reporter: Callista Rogers

Note: This issue was originally created as ARROW-15910. Please see the migration documentation for further details.

Antoine Pitrou / @pitrou:
Can you try with PyArrow 7.0.0?
Also cc @jorisvandenbossche

Callista Rogers:
Hey Antoine,

Same results in both tests with PyArrow 7.0.0 as well

Joris Van den Bossche / @jorisvandenbossche:
The first error is expected I think, because you can either pass a URI (with "gs://") or either a path + filesystem object (in which case the file path cannot contain "gs://"). And since we do not yet expose GCS filesystem discovery in python, the only option is to do it via the fsspec filesystem, and thus you have to pass a file path and not URI.

Removing the gs:// from file_path results in a FileNotFoundError. Any variation of / or // at the beginning of the path gives me the 'outside base dir' error.

So this would then be expected to work (remove the gs://).

Can you show the full error traceback from pq.read_table(file_path,filesystem=fs) (with file_path without gs://)?

Callista Rogers:

FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_1684296/1250703457.py in <module>
      3 file_path="MyBucket/path/Name_of_parquet.parquet/"
      4 fs=gcsfs.GCSFileSystem()
----> 5 table=pq.read_table(file_path,filesystem=fs)

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
   1968                 ignore_prefixes=ignore_prefixes,
   1969                 pre_buffer=pre_buffer,
-> 1970                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   1971             )
   1972         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
   1782                                    format=parquet_format,
   1783                                    partitioning=partitioning,
-> 1784                                    ignore_prefixes=ignore_prefixes)
   1785 
   1786     @property

/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    665 
    666     if _is_path_like(source):
--> 667         return _filesystem_dataset(source, **kwargs)
    668     elif isinstance(source, (tuple, list)):
    669         if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    420     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
    421 
--> 422     return factory.finish(schema)
    423 
    424 

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.DatasetFactory.finish()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()

/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py in open_input_file(self, path)
    392 
    393         if not self.fs.isfile(path):
--> 394             raise FileNotFoundError(path)
    395 
    396         return PythonFile(self.fs.open(path, mode="rb"), mode="r")

Joris Van den Bossche / @jorisvandenbossche:
Could you also show the output of

import gcsfs
file_path = "MyBucket/path/Name_of_parquet.parquet/"
fs = gcsfs.GCSFileSystem()
print(fs.info(flle_path))

from pyarrow.fs import PyFileSystem, FSSpecHandler
pa_fs = PyFileSystem(FSSpecHandler(fs))
print(pa_fs.get_file_info(file_path))

Callista Rogers:
from fs.info(file_path):
{'bucket': 'MyBucket', 'name': 'MyBucket/path/Name_of_parquet.parquet/', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}

from pa_fs.get_file_info(file_path))
<FileInfo for 'MyBucket/path/Name_of_parquet.parquet/': type=FileType.Directory>

Joris Van den Bossche / @jorisvandenbossche:
[~crogers923] thanks for the quick follow-up.

It's strange that it correctly sees a directory, but then the actual reading fails with "FIleNotFound" (thinking it is a file, not a directory).
But I remember now that a few weeks ago we had a similar issue on the user mailing list with gcsfs giving such error (see my answer at https://lists.apache.org/thread/d0fccn94ovt2hh6cgyktcvz127x5pysw). In that case, it mattered whether you called the "info" method the first or the second time. Can you check that here as well? The above output that you show, is that the output you get when running that the first time? (after restarting the interactive (console) session)

Callista Rogers:
Oh that's interesting. The first time I run fs.info, I get:

{'kind': 'storage#object', 'id': 'MyBucket/path/name_of_parquet.parquet//1646930508287024', 'selfLink': 'https://www.googleapis.com/storage/v1/b/MyBucket%2Fpath/name_of_parquet.parquet%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/MyBucket/o/path%2Fname_of_parquet.parquet%2F?generation=1646930508287024&alt=media', 'name': 'MyBucket/path/name_of_parquet.parquet/', 'bucket': 'MyBucket', 'generation': '1646930508287024', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'CLCYqp/+u/YCEAE=', 'timeCreated': '2022-03-10T16:41:48.428Z', 'updated': '2022-03-10T16:41:48.428Z', 'timeStorageClassUpdated': '2022-03-10T16:41:48.428Z', 'type': 'file'}

from pa_fs.get_file_info(file_path)) the first time
<FileInfo for 'myBucket/features/MyParquet.parquet/': type=FileType.File, size=0>

Joris Van den Bossche / @jorisvandenbossche:
OK, so it's the same bug in gcsfs from the mailing list thread that you are running into: because gcsfs returns something different the first vs subsequent times, pyarrow is confused about whether it is querying a file or directory. I know that gcsfs uses size 0 empty files to mimic directories (since the filesystem itself doesn't have that concept), but I think it can still be expected from gcsfs to return a consistent info for the same file path.

I would open an issue about this on the gcsfs side (I don't think that happened after the previous mailing list thread).

If you first do such an initial "info" call, and then do the read_table call (which now should directly think the file path is a directory), does it then work correctly?

Callista Rogers:
It doesn't work, I still get the FileNotFound error.

I'll open a ticket with gcsfs. Thanks!

Joris Van den Bossche / @jorisvandenbossche:

It doesn't work, I still get the FileNotFound error.

That's strange, I expected that that would be a workaround. Does it give the exact same error message and traceback? Can you show the output of doing the get_file_info twice (so it shows the file path is a directory the second time), and then passing that path to the read_table ?

Callista Rogers:
Running it the first time on a fresh kernel I actually get a shorter traceback. The above matches the second time running traceback.

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_48702/3635827774.py in <module>
----> 1 table=pq.read_table(file_path,filesystem=fs)

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
   1968                 ignore_prefixes=ignore_prefixes,
   1969                 pre_buffer=pre_buffer,
-> 1970                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   1971             )
   1972         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
   1764 
   1765             self._dataset = ds.FileSystemDataset(
-> 1766                 [fragment], schema=fragment.physical_schema,
   1767                 format=parquet_format,
   1768                 filesystem=fragment.filesystem

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()

/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py in open_input_file(self, path)
    392 
    393         if not self.fs.isfile(path):
--> 394             raise FileNotFoundError(path)
    395 
    396         return PythonFile(self.fs.open(path, mode="rb"), mode="r")
 

I think these are related -- starburstdata/dbt-trino#226

is there a way for pyarrow to ignore the '0' (i.e s3 dir files) of a directory