read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem'

We have nightly testing of reading geoparquet in our s3 buckets (using intake-geopandas). This started failing with the release of dask 2023.4.0 three days ago cc. @jrbourbeau.

I try and update this if I can find a geoparquet hosted on a public s3 bucket.

Create new environment:
mamba create -n test_env python=3.10 --y && conda activate test_env

Install dask-geopandas and s3fs:
pip install dask-geopandas s3fs

open a (geo)parquet:

import dask_geopandas as dgpd
dgpd.read_parquet("s3://BUCKET/FILE.parquet")

Traceback (most recent call last):
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 135, in wrapper
    return func(*args, **kwargs)
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 519, in read_parquet
    fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
AttributeError: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask_geopandas/io/parquet.py", line 111, in read_parquet
    result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 137, in wrapper
    raise type(e)(
AttributeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

See packages installed:

pip freeze

aiobotocore==2.5.0
aiohttp==3.8.4
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
botocore==1.29.76
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
dask==2023.4.0
dask-geopandas==0.3.0
distributed==2023.4.0
Fiona==1.9.3
frozenlist==1.3.3
fsspec==2023.4.0
geopandas==0.12.2
HeapDict==1.0.1
idna==3.4
importlib-metadata==6.4.1
Jinja2==3.1.2
jmespath==1.0.1
locket==1.0.0
MarkupSafe==2.1.2
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
numpy==1.24.2
packaging==23.1
pandas==2.0.0
partd==1.4.0
psutil==5.9.5
pyproj==3.5.0
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
s3fs==2023.4.0
shapely==2.0.1
six==1.16.0
sortedcontainers==2.4.0
tblib==1.7.0
toolz==0.12.0
tornado==6.2
tzdata==2023.3
urllib3==1.26.15
wrapt==1.15.0
yarl==1.8.2
zict==2.2.0
zipp==3.15.0

Thanks @raybellwaves. I wonder if this is a duplicate of #241?

This started failing with the release of dask 2023.4.0 three days ago

I'm not aware of any related changes in this release. The extract_filesystem method in the traceback was added several releases ago (xref dask/dask#9699).

Also, as Joris mentioned here #241 (comment), I would expect GeoArrowEngine to have an extract_filesystem method regardless since it subclasses the arrow parquet engine in dask

Cross posting from the other thread #241 (comment)

=======
hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import.

dask-geopandas/dask_geopandas/io/parquet.py

Lines 15 to 22 in d3e15d1

    
           try: 
        
               # pyarrow is imported here, but is an optional dependency 
        
               from dask.dataframe.io.parquet.arrow import ( 
        
                   ArrowDatasetEngine as DaskArrowDatasetEngine, 
        
               ) 
        
           except ImportError: 
        
               DaskArrowDatasetEngine = object

I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported.

To reiterate

this fails

pip install dask dask-geopandas

this works

pip install dask dask-geopandas  pyarrow
# or 
pip install dask[complete] dask-geopandas

might want to do something similar to how geopandas check if pygeos is installed

https://github.com/geopandas/geopandas/blob/04c2dee547777d9e87f9df4c85cc372a03b29f93/geopandas/_compat.py#L51-L67

	try:
	# pyarrow is imported here, but is an optional dependency
	from dask.dataframe.io.parquet.arrow import (
	ArrowDatasetEngine as DaskArrowDatasetEngine,
	)
	except ImportError:
	DaskArrowDatasetEngine = object