geopandas/dask-geopandas

'dask_geopandas.from_dask_dataframe' produces error: 'DataFrame' object has no attribute 'map_partitions'

komzy opened this issue · 1 comments

komzy commented

I'm writing a simple code to read a large geojson file (>3 GB) into dask and convert to dask-geopandas dataframe. However I run into the above error.

Here's my code:

import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import dask_geopandas
import dask.dataframe as dd

dask_df = dd.read_json('madagascar_gen.txt',orient='list').compute()
dgpd = dask_geopandas.from_dask_dataframe(dask_df, geometry="geometry")

Error log:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 dgpd = dask_geopandas.from_dask_dataframe(dask_df, geometry="geometry")

File ~/opt/anaconda3/lib/python3.9/site-packages/dask_geopandas/core.py:790, in from_dask_dataframe(df, geometry)
    786     name = geometry.name if geometry.name is not None else "geometry"
    787     return df.assign(**{name: geometry}).map_partitions(
    788         geopandas.GeoDataFrame, geometry=name
    789     )
--> 790 return df.map_partitions(geopandas.GeoDataFrame, geometry=geometry)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'map_partitions'

madagascar_gen.json:

[
{"geometry":{"coordinates":[44.3207501,-20.290752],"type":"Point"},"type":"Feature","properties":{"oID":"1","timestamp":"2022-09-02 11:05:44"}},
{"geometry":{"coordinates":[44.32089653504225,-20.290709591647275],"type":"Point"},"type":"Feature","properties":{"oID":"1","timestamp":"2022-09-02 11:05:44"}},
{"geometry":{"coordinates":[44.32104297004467,-20.290667183294346],"type":"Point"},"type":"Feature","properties":{"oID":"1","timestamp":"2022-09-02 11:05:44"}},
...
]

Anyone know why this is happening?

You are not passing a dask.dataframe to dask_geopandas.from_dask_dataframe. When you call compute(), dask computes the task graph and returns a pandas dataframe. The code above should be like this if you want to read with dask.dataframe:

dask_df = dd.read_json('madagascar_gen.txt',orient='list')
dgpd = dask_geopandas.from_dask_dataframe(dask_df, geometry="geometry")

But given the file is geojson, you will need to create geometry array yourself. The better option would be to read directly with dask-geopandas.

dgpd = dask_geopandas.read_file("madagascar_gen.json", npartitions=4)