ENH: output format geoparquet
jsignell opened this issue · 9 comments
This came up in the STAC meeting today. Currently it looks like the supported output formats (available with the f
query param) are: 'geojson', 'html', 'json', 'csv', 'geojsonseq', 'ndjson'. I got that list by naively trying https://firenrt.delta-backend.com/collections/public.eis_fire_lf_perimeter_nrt/items?f=geoparquet
It would be neat to add 'geoparquet' as an option.
Not sure if this is the right place to capture the request so feel free to close, just wanted to increase the visibility of that conversation.
thanks for starting the discussion @jsignell 🙏
This is definitely something we could/should support.
Which endpoints should have GeoParquet? items
here?
Lines 1065 to 1067 in be04e60
It might be easiest to use the existing https://github.com/stac-utils/stac-geoparquet library for this
yeah exactly! I think /items
under just a different f
query param. stac-geoparquet will definitely be helpful, but might need some alterations since these aren't STAC objects exactly.
@kylebarron yes, I think it makes senses to enable GeoParquet output for Items
first. (we might want collections later but it will be less useful).
It might be easiest to use the existing https://github.com/stac-utils/stac-geoparquet library for this
stac-geoparquet, depends on pandas and geopandas (thus shapely), this would be quite heavy dependencies just
to add an output format. I was hopping for a more lightweight solution 🙏
Unfortunately GeoParquet currently requires rather heavy dependencies to read and write from Python.
For one, the primary way to read and write Parquet in Python is via pyarrow
, and that's an 80MB wheel on top of Numpy:
pip install pyarrow -t pyarrow_tmp
du -csh pyarrow_tmp/*
12K pyarrow_tmp/bin
60M pyarrow_tmp/numpy
220K pyarrow_tmp/numpy-1.25.2.dist-info
85M pyarrow_tmp/pyarrow
204K pyarrow_tmp/pyarrow-12.0.1.dist-info
146M total
Additionally, the GeoParquet spec says to store geometries in WKB, so you need some way to convert your existing geometries into WKB, and Shapely seems like the easiest to reach for.
People have been discussing making pyarrow more modular so that the bundle size is smaller, but nothing has happened yet. When my Rust geoarrow library and its Python bindings are more stable (not imminently) it might be a good choice for stuff like this that intends to be able to be deployed on lambda.
🤯 I don't think this feature is extremely needed right now so we can wait especially if this can help for your library to be ready :-)
FYI: we already have pyproj dependency (via morecantile)
Note: we could still add an heavy dependency and make the whole thing optional if this is really something user/customers want
Definitely agree with making it an optional dependency if we add it.
Yes, definitely would want to add this as optional dependency. We would probably want to implement this starting from a query like SELECT column_a, column_b, ST_ASWKB(geometry_column) FROM mytable
rather than going through the geojson that we create and build up the geoparquet from those results which would eliminate the need for shapely or the like.
I'm not too familiar with the tipg internals but happy to help implement this