duckdblabs/duckplyr

Support for geo data

Opened this issue · 4 comments

This looks great! One feature request I have in mind is support (either via new functions/functionality or via documentation if it works out of the boxx) for spatial data. See this by @cboettig for inspiration: https://github.com/cboettig/duckdbfs#spatial-data

Another potential source of inspiration is sf's support for tidy operations, it great how summarise() and other functions 'just work' with tidy verbs: https://r-spatial.github.io/sf/reference/tidyverse.html

Thanks for raising this, Robin. Integration with the duckdb spatial extension would be a really cool feature, but also a lot of work.

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

Adding support for functions is then "only" a matter of diligence: https://github.com/duckdblabs/duckplyr/pull/179/files#diff-a202cfba76540d6822868ac7755edd4945b6344057d78e0092f4836e33c0d4eaR11 .

Do we need to figure out how to translate sf data frames into something that the duckdb spatial extension understands, and vice versa?

I imagine so, and given that everything other than the geometry column is already sorted, it's just the geometry that needs converting (safe to assume just 1 geometry column in 99% of use cases I think).

Seems like DuckDB -> sf has been implemented here: https://github.com/cboettig/duckdbfs/blob/main/R/to_sf.R

Not sure how hard the other way would be let alone how to make it fast.

The duckdb -> sf conversion there is mostly solid, but could be a bit better. Currently there's a couple different ways in which geospatial data is stored in duckdb:

  • If duckdb reads in a vector format file (shapefile, geodatabase, anything BUT geoparquet), it parses with gdal and converts to duckdb's internal geometry. This is the use-case that the above handles. (Though I think the column name for the geometry is inherited from the file, e.g. might not be called geometry, so really we need to handle this.

  • If duckdb reads in geoparquet, it does not use the gdal parser (because duckdb's native parquet parser is so much faster!). However, this also means (at least currently) that the column is read in as a binary blob and not the native geometry, so we need an extra coercion. I've been meaning to add this, though it might eventually be solved upstream, see duckdb/duckdb_spatial#299 (comment)

Re sf -> duckdb, I don't think this is much of an issue, though there are various ways to do it depending on precisely what you mean by "to duckdb". Specifically, I think the best thing to do is simply have sf write out as a geoparquet file to disk. (this assumes sf is built with recent gdal that has arrow support of course!). Since presumably this use case means the data is small enough to fit in RAM, writing out as, say, geodatabase is probably just as good (maybe better given the issue noted above), and then have duckdb read that in. It is possible to write to duckdb's native database format with DBI instead (i.e. with the WKB-binary column), and then you'd need the extra coercion once in duckdb to make it into duckdb's internal spatial type, but I don't see the use for that. (For most users I think it's actually better to pretend that duckdb's native database doesn't exist and work directly against flat files).

Sorry, long story short, I think duckdbfs should handle both cases (simply noting that sf should serialize to disk in any standard spatial format), modulo this edge case about geoparquet.