read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem'
raybellwaves opened this issue · 3 comments
We have nightly testing of reading geoparquet in our s3 buckets (using intake-geopandas). This started failing with the release of dask 2023.4.0 three days ago cc. @jrbourbeau.
I try and update this if I can find a geoparquet hosted on a public s3 bucket.
Create new environment:
mamba create -n test_env python=3.10 --y && conda activate test_env
Install dask-geopandas and s3fs:
pip install dask-geopandas s3fs
open a (geo)parquet:
import dask_geopandas as dgpd
dgpd.read_parquet("s3://BUCKET/FILE.parquet")
Traceback (most recent call last):
File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 135, in wrapper
return func(*args, **kwargs)
File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 519, in read_parquet
fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
AttributeError: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask_geopandas/io/parquet.py", line 111, in read_parquet
result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 137, in wrapper
raise type(e)(
AttributeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'
See packages installed:
pip freeze
aiobotocore==2.5.0
aiohttp==3.8.4
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
botocore==1.29.76
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
dask==2023.4.0
dask-geopandas==0.3.0
distributed==2023.4.0
Fiona==1.9.3
frozenlist==1.3.3
fsspec==2023.4.0
geopandas==0.12.2
HeapDict==1.0.1
idna==3.4
importlib-metadata==6.4.1
Jinja2==3.1.2
jmespath==1.0.1
locket==1.0.0
MarkupSafe==2.1.2
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
numpy==1.24.2
packaging==23.1
pandas==2.0.0
partd==1.4.0
psutil==5.9.5
pyproj==3.5.0
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
s3fs==2023.4.0
shapely==2.0.1
six==1.16.0
sortedcontainers==2.4.0
tblib==1.7.0
toolz==0.12.0
tornado==6.2
tzdata==2023.3
urllib3==1.26.15
wrapt==1.15.0
yarl==1.8.2
zict==2.2.0
zipp==3.15.0
Thanks @raybellwaves. I wonder if this is a duplicate of #241?
This started failing with the release of dask 2023.4.0 three days ago
I'm not aware of any related changes in this release. The extract_filesystem
method in the traceback was added several releases ago (xref dask/dask#9699).
Also, as Joris mentioned here #241 (comment), I would expect GeoArrowEngine
to have an extract_filesystem
method regardless since it subclasses the arrow parquet engine in dask
Cross posting from the other thread #241 (comment)
=======
hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import.
dask-geopandas/dask_geopandas/io/parquet.py
Lines 15 to 22 in d3e15d1
I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported.
To reiterate
this fails
pip install dask dask-geopandas
this works
pip install dask dask-geopandas pyarrow
# or
pip install dask[complete] dask-geopandas
might want to do something similar to how geopandas check if pygeos is installed