geopandas/dask-geopandas

Mistaken documentation for dask_geopandas.read_parquet

barbuz opened this issue · 2 comments

I am trying to use the read_parquet function in my code, and I've encountered some trouble where the documentation says one thing and the code seems to be doing something different.

These are the two discrepancies I have found, but there could be others:

  • The engine parameter cannot really be set. dask-geopandas uses a custom GeoArrowEngine engine, and trying to specify anything else raises an exception.
  • The documentation says that for split_row_groups "Default is True if a _metadata file is available or if the dataset is composed of a single file (otherwise defult is False).". It looks like the code does not even try to provide a default value and just passes it directly to dask.dataframe.read_parquet, which defaults to False in all cases.

I think the read_parquet docstring is just copied directly over from dask.dataframe.read_parquet at

read_parquet.__doc__ = dd.read_parquet.__doc__
.

Your first point is correct, only the custom engine is allowed in dask-geopandas I think.

For your second point, maybe raise an issue in dask/dask if you think that documentation isn't accurate there too. I know there's been some churn around that parameter recently and it may be out of date.


As for what to do here, I'm not sure. dask-geopandas mostly aligns with dask.dataframe.read_parquet, so it's nice to pick up the changes from there automatically. Dask does include a derived_from decorator that can be used to copy over docstrings, with some control over things like "unused arugments": https://github.com/dask/dask/blob/34a1e88bb3f6196361f398ddab55e59d315d8d40/dask/utils.py#L818. Perhaps that could be used to add a caveat about the engine.

I see, thank you. What confused me is the fact that the dask documentation is different than the dask-geopandas one for split_row_groups. But maybe as you said it could just be out of date.
I've noticed the effects of the derived_from decorator, where some pages (example) have a disclaimer saying that this docstring was copied from somewhere else. Using the decorator for read_parquet and to_parquet should make the same disclaimer appear there, but I believe that would involve rewriting the function definitions so that the arguments used are listed explicitly and not just packed into *args and *kwargs.