geoarrow/geoarrow-r

`st_collect()`, `st_as_sf()`, and default conversion from Arrow to R

Opened this issue · 8 comments

Right now, geoarrow doesn't convert to sf by default and instead maintains a zero-copy shell around the ChunkedArray from whence it came. This is instantaneous and is kind of like ALTREP for geometry, since we can't do ALTREP on lists like Arrow does for character, integer, and factor. This is up to 10x faster and prevents a full copy of the geometry column. I also rather like that it maintains neutrality between terra, sf, vapour, wk, or others that may come along in the future...who are we to guess where the user wants to put the geometry column next? The destination could be Arrow itself (e.g., via group_by() %>% write_dataset()), or the column could get dropped, filtered, or rearranged before calling an sf method.

However, 99% of the time a user just wants an sf object. After #20 we can use sf::st_as_sf() on an arrow_dplyr_query to collect() it into an sf object, and @boshek suggested st_collect(), which is a way better name and is more explicit than a st_as_sf(). There's also st_geometry(), st_crs(), st_bbox(), and st_as_crs() methods for the geoarrow_vctr column; however, we still get an awkward error if we collect() and then try to convert to sf:

vctr <- geoarrow::geoarrow(wk::wkt("POINT (0 1)", crs = "OGC:CRS84"))
df <- data.frame(geometry = vctr)
sf::st_as_sf(df)
#> Error in st_sf(x, ..., agr = agr, sf_column_name = sf_column_name): no simple features geometry column present

That might be solvable in sf, although I'd like to give the current implementation a chance to get tested to collect feedback on whether this is or is not a problem for anybody before committing to the current zero-copy-shell-by-default.

Two thoughts come to mind:

  1. I can now very much see the reasons for not having collect() return an sf object. From just an api perspective I think it probably messes with a users expectation of what they are getting particular given the normal arrow context. Wherever this is ultimately defined I also think st_collect() mirrors dplyr-like functions (e.g st_filter() and st_join()) and thus fits in that api idiom.
  2. As a curiousity, we previously did implement collect() in {bcdata} to return sf objects. bcdata queries a web feature service so the idea of not returning something with spatial information never actually came up. E.g.:
library(bcdata)
airports <- bcdc_query_geodata("bc-airports")

ap <- airports %>% 
    filter(LOCALITY == "Vancouver") %>% 
    select(AIRPORT_NAME) %>% 
    collect()
class(ap)
[1] "bcdc_sf"    "sf"         "tbl_df"     "tbl"        "data.frame"

That's clever! I never hadn't looked into the internals of bcdata but always wished there was an nsdata I could use to make lake maps. I'll definitely be using that for blog posts/reprexes!

The arrow_dplyr_query object for which the dplyr methods are implemented in Arrow can't know about the sticky geometry column, although one could make a wrapper class that does just that ("sf_arrow_dplyr_query"?). Then, if somebody does something like open_dataset_sf() the method wrappers could apply the sticky geometry column rules lazily. That sounds like a lot of work to me but would be nice.

I suppose the underlying tension is the (maybe just my?) desire for geoarrow to provide a flexible and lightning-fast developer-facing interface but also provide users with some convenience functions.

I hope you don't mind me chiming in here. Just thought I would add a voice for collect() over st_collect() as that's the pattern people are used to for executing a query and returning the result. In my mind it's not associated with the object type but with the process. I would guess that people working with a geoparquet dataset would expect collect() to deal appropriately with the geometry column - that's why we went with collect() in {bcdata}. As you say, writing a collect() method would require a class to be applied to the dataset/query. So the pattern would be:

query <- open_dataset_sf("nc.parquet") %>%
  filter(grepl("^A", NAME)) %>%
  select(NAME, geometry) %>%
  collect() # or st_as_sf()

vs

query <- open_dataset("nc.parquet") %>%
  filter(grepl("^A", NAME)) %>%
  select(NAME, geometry) %>%
  st_collect() # or st_as_sf()

In the end it probably doesn't really matter... but I kind of like defining the expected output upfront - and if someone is not using sf but wants to keep it as a zero-copy shell they can just use open_dataset(). I also wonder if classing the object when you open the dataset will make it cleaner to implement other features in the future (e.g. spatial predicates to use inside filter() calls)

Actually one other thought occurred to me... you could define open_dataset_geo() that just applies a wrapper class "geoarrow_dplyr_query" defining it as a geoarrow dataset, and staying agnostic as to the eventual destination. Then you could write a collect.geoarrow_dplyr_query method with an as argument for the desired return type and/or write methods for the common return types (as you've already done with st_as_sf()).

Love the chiming in!

I really like the geo-specific dataset opener concept...much easier for an sf user to follow because we'd get the geometry column handling that's used there. It looks like we'd be able to draw from bcdata's code for this.

That said, it's also a lot of work to implement a dplyr backend! My personal development priority right now is at a very low level but I'm happy to prototype something if there are others willing to take it forward. This could probably be generalized to wrap any dplyr backend with a geometry column, too (e.g, PostGIS, OGR layer via SQL, tibble with a non-sfc geometry column), although at that point there starts to be some significant overlap with sf.

Keep the ideas coming and let me know what I can do to move things forward!

Yeah, I definitely wondered about alignment with any other work/conversations for a dplyr backend. There is this which I somehow missed before now: https://github.com/hypertidy/lazysf.

It makes sense that the low-level stuff is the priority right now. Unfortunately for the time being I probably don't have much time to actually contribute meaningfully, but the conversations are fun and good to record for the future!

collect() shouldn't return sf, it should always be closest to native, with explicit conversions, lazysf is pretty rubbish but that distinction to st_as_sf() was pretty clear

why not column of smart pointers to GDAL features? in RODBC days a dataframe couldn't even print a raw column without erroring, it was stifling ... I just wanted to read from Manifold and push stuff around ... then {blob} came along and lazy_tbl and now there's even {pool} for keeping pointers alive!

so much is possible, this is exciting!

collect() shouldn't return sf, it should always be closest to native, with explicit conversions

Agreed...I think we may also be able to get some sf-like behaviour by implementing methods like st_geometry() so that you can pass geoarrow.XXX vectors directly.

why not column of smart pointers to GDAL features?

One of the cool parts about GDAL's RFC86 is that for some drivers a GDAL feature is never instantiated (e.g., for the gpkg driver it just copies the blob directly from sqlite), which is what makes it so much faster. From the R side, we really want to avoid a list() of anything because it makes the garbage collector go haywire after about a million features. The "geoarrow.XXX" classes only ever keep one external pointer alive at any given time (the pointer to the Arrow array) and under the hood they are just 1:length(x) with attributes. If you want to test the slow/fastness of this, time the garbabe collector using wk::xy() and geos::geos_geometry() with a few million features in memory!

now there's even {pool} for keeping pointers alive!

That's awesome...I will take a closer look!