xarray-contrib/cf-xarray

scalability of geometry encoding

Opened this issue · 3 comments

I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code

import geopandas as gp
import xvec
import xarray as xr

url = (
    "s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
    "part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)

# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows

# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)

# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s

I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.

I'm wondering if there are some low-hanging fruit that can be found to optimize this.

Alternatively, we could explore storing geometries as WKB, as geoparquet does.

image

The whole thing:
image


I think we can do better though, primarily by supporting chunked arrays as input. The arrays get passed to GEOS, so we'll need to map_blocks at the appropriate places.

Everything else is up to shapely.to_ragged_array. That function has 10% overhead from doing this kind of error-checking:

   250         1        282.0    282.0      0.0      if include_z is None:
   251         2     725395.0 362697.5      0.0          include_z = np.any(
   252         1  749690266.0    7e+08      5.5              get_coordinate_dimension(geometries[~is_empty(geometries)]) == 3
   253                                                   )
   254                                           
   255         1  552494652.0    6e+08      4.0      geom_types = np.unique(get_type_id(geometries))
   256                                               # ignore missing values (type of -1)
   257         1      21755.0  21755.0      0.0      geom_types = geom_types[geom_types >= 0]

@martinfleis a couple of questions:

  1. is it possible for us to set include_z based on what GeometryIndex knows?
  2. np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

is it possible for us to set include_z based on what GeometryIndex knows?

Not without writing that line of code shown above in our code. GeometryIndex does not have an idea about this and to determine the presence of Z we would need to check it across all geometries exactly as shapely does.

np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

Could you open a request in shapely? That would need to be implemented there by someone who speaks C.

Could you open a request in shapely?

OK let me do some more thorough profiling and I'll open an issue there.