developmentseed/tipg

ENH: output format geoparquet

jsignell opened this issue · 9 comments

This came up in the STAC meeting today. Currently it looks like the supported output formats (available with the f query param) are: 'geojson', 'html', 'json', 'csv', 'geojsonseq', 'ndjson'. I got that list by naively trying https://firenrt.delta-backend.com/collections/public.eis_fire_lf_perimeter_nrt/items?f=geoparquet

It would be neat to add 'geoparquet' as an option.

Not sure if this is the right place to capture the request so feel free to close, just wanted to increase the visibility of that conversation.

thanks for starting the discussion @jsignell 🙏

This is definitely something we could/should support.

Which endpoints should have GeoParquet? items here?

tipg/tipg/factory.py

Lines 1065 to 1067 in be04e60

output_type: Annotated[
Optional[MediaType], Depends(ItemsOutputType)
] = None,

It might be easiest to use the existing https://github.com/stac-utils/stac-geoparquet library for this

yeah exactly! I think /items under just a different f query param. stac-geoparquet will definitely be helpful, but might need some alterations since these aren't STAC objects exactly.

@kylebarron yes, I think it makes senses to enable GeoParquet output for Items first. (we might want collections later but it will be less useful).

It might be easiest to use the existing https://github.com/stac-utils/stac-geoparquet library for this

stac-geoparquet, depends on pandas and geopandas (thus shapely), this would be quite heavy dependencies just to add an output format. I was hopping for a more lightweight solution 🙏

Unfortunately GeoParquet currently requires rather heavy dependencies to read and write from Python.

For one, the primary way to read and write Parquet in Python is via pyarrow, and that's an 80MB wheel on top of Numpy:

pip install pyarrow -t pyarrow_tmp
du -csh pyarrow_tmp/*
 12K	pyarrow_tmp/bin
 60M	pyarrow_tmp/numpy
220K	pyarrow_tmp/numpy-1.25.2.dist-info
 85M	pyarrow_tmp/pyarrow
204K	pyarrow_tmp/pyarrow-12.0.1.dist-info
146M	total

Additionally, the GeoParquet spec says to store geometries in WKB, so you need some way to convert your existing geometries into WKB, and Shapely seems like the easiest to reach for.

People have been discussing making pyarrow more modular so that the bundle size is smaller, but nothing has happened yet. When my Rust geoarrow library and its Python bindings are more stable (not imminently) it might be a good choice for stuff like this that intends to be able to be deployed on lambda.

🤯 I don't think this feature is extremely needed right now so we can wait especially if this can help for your library to be ready :-)

FYI: we already have pyproj dependency (via morecantile)

Note: we could still add an heavy dependency and make the whole thing optional if this is really something user/customers want

Definitely agree with making it an optional dependency if we add it.

bitner commented

Yes, definitely would want to add this as optional dependency. We would probably want to implement this starting from a query like SELECT column_a, column_b, ST_ASWKB(geometry_column) FROM mytable rather than going through the geojson that we create and build up the geoparquet from those results which would eliminate the need for shapely or the like.

I'm not too familiar with the tipg internals but happy to help implement this