[Feature] Dissolve Boundaries
JosiahParry opened this issue · 13 comments
A common GIS task is to dissolve boundaries based on shared boundaries of polygons. Doing this with R always bends my head a little bit.
I think having a utility function in {sf} to do this would be very handy.
Here is a minimal example of how that function might work. Using spdep
is probably the best for this task because it already has functions for identifying contiguous polygons as well as the connected subgraphs.
This is inspired by @CGMossa's PhD work
dissolve_boundaries <- function(x) {
if (!requireNamespace("spdep")) {
stop("spdep is needed to use this function")
}
crs <- sf::st_crs(x)
nbs <- spdep::poly2nb(x)
nb_ids <- attr(nbs, "ncomp")[["comp.id"]]
res <- lapply(
split(x, nb_ids),
sf::st_union,
is_coverage = TRUE
)
sf::st_sfc(
unlist(res, recursive = FALSE),
crs = crs
) |> sf::st_combine()
}
It took me a few minutes to understand what exactly this function would do in comparison to st_cast(st_union(x), "POLYGON")
so I made a little reprex to show it.
library(sf)
nc = read_sf(system.file("shape/nc.shp", package="sf"))
set.seed(241)
test = nc[sample(nrow(nc), 18),] |> st_geometry()
a = test |> st_union() |> st_as_sf()
a$id = 1:nrow(a)
b = a |> st_cast("POLYGON") |> st_as_sf()
#> Warning in st_cast.sf(a, "POLYGON"): repeating attributes for all
#> sub-geometries for which they may not be constant
b$id = 1:nrow(b)
c = test |> dissolve_boundaries() |> st_as_sf()
#> Loading required namespace: spdep
c$id = 1:nrow(c)
b
#> Simple feature collection with 11 features and 1 field
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS: NAD27
#> First 10 features:
#> id x
#> 1 1 POLYGON ((-83.1615 35.05922...
#> 1.1 2 POLYGON ((-76.00897 36.3196...
#> 1.2 3 POLYGON ((-75.97629 36.5179...
#> 1.3 4 POLYGON ((-75.78317 36.2251...
#> 1.4 5 POLYGON ((-82.74389 35.4180...
#> 1.5 6 POLYGON ((-78.95108 36.2338...
#> 1.6 7 POLYGON ((-76.70538 35.4119...
#> 1.7 8 POLYGON ((-76.6949 35.35043...
#> 1.8 9 POLYGON ((-81.659 36.11759,...
#> 1.9 10 POLYGON ((-80.45065 35.7648...
nrow(b)
#> [1] 11
c
#> Simple feature collection with 8 features and 1 field
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS: NAD27
#> x id
#> 1 MULTIPOLYGON (((-75.97629 3... 1
#> 2 POLYGON ((-77.98668 34.3399... 2
#> 3 POLYGON ((-78.95108 36.2338... 3
#> 4 MULTIPOLYGON (((-76.70538 3... 4
#> 5 POLYGON ((-82.24016 35.4681... 5
#> 6 POLYGON ((-81.659 36.11759,... 6
#> 7 POLYGON ((-80.50825 36.0708... 7
#> 8 POLYGON ((-83.1615 35.05922... 8
nrow(c)
#> [1] 8
cols = paletteer::paletteer_d("tvthemes::gravityFalls")
par(mfrow = c(2,2), mar = c(0,0,1,0))
plot(test, main = "Original geoms")
plot(a, main = "st_union()",
key.pos = NULL, reset = FALSE, pal = cols)
plot(b, main = "st_union |> st_cast('POLYGON')",
key.pos = NULL, reset = FALSE, pal = cols)
plot(c, main = "dissolve_boundaries()",
key.pos = NULL, reset = FALSE, pal = cols)
It would be cool to have this in sf! Would this fit better as a possible st_dissolve()
function or as a special case of st_union()
?
Thanks @loreabad6 that clarifies! My guess is that the only difference between the bottom two plots is a single queen neighbour, and I'm not sure that is what you want with "dissolve".
Forgive me if this is not the right thread to discuss this, but I have always wondered why {sf} didn't have an existing st_dissolve()
function that does something similar to the following:
object_dissolved <- object %>%
group_by(attribute) %>%
summarize(
geometry = sf::st_union(geometry), # or SHAPE in case of data loaded from GDB
across(everything(), first),
.groups = "drop"
)
Here it is easy since I have single aggregation for all fields (i.e. first
), however, it might be complicated to use different aggregate functions for different columns. For instance, I might want to sum()
an attribute called area
, take a mean()
of population
, and calculate median()
of median_household_income
. In that case, it might be much easier to group_by()
and st_union()
and specify aggregate function separately for different columns. While it would be cool to have a st_dissolve()
function, I am unsure how difficult would it be to implement multiple aggregation within the same function.
Could I ask why the aggregate
method for sf
objects is not adequate for these needs?
Could I ask why the
aggregate
method forsf
objects is not adequate for these needs?
I do not want to hijack the initial issue, but the aggregate function is much less intuitive as is, which is not consistent with other SF
functions. If I were to want to take the mean
of numeric columns, sum
the integer column, and take the first
values of the character column, things would get messy really fast. I ended up using group_by()
and summarize()
.
# Select only the numeric columns
numeric_cols <- sapply(wfrc_taz, is.numeric)
# Perform the aggregation only on numeric columns
wfrc_taz_agg <- aggregate(
wfrc_taz[, numeric_cols], # Subset to numeric columns
by = list(CO_NAME = wfrc_taz$CO_NAME), # Aggregate by CO_NAME
FUN = sum
)
However, the function provided by @rCarto seems to be the exact one I was looking for. Thank you so much. The only wish/question I have is if it would work with logical filters such as in the code below. Also, maybe the fun
argument could take in the actual function (again for consistency), rather than a character of the function name.
r <- st_aggregate(nc, "dummy", c(is.character(), is.integer(), is.numeric()), c(first, sum, mean))
In that case, we would expand the by
argument to take in values such as r colnames %in% c("BIR74", "NWBIR74")
and so on to apply different aggregation functions to different sets of columns.
A pipe-friendly st_dissolve function would be really helpful!
I took the sample provided by @JosiahParry and turned it into a more involved pair of helper functions that handles grouped sf data input (updated this reprex to better handle the .by grouping argument).
Safely use spdep::poly2nb and spdep::n.comp.nb to get an index of neighboring
features
poly_2_nb_id <- function(x,
snap = NULL,
queen = TRUE,
quiet = TRUE,
...) {
rlang::check_installed("spdep")
fn <- invisible
if (quiet) {
fn <- suppressWarnings
}
rlang::try_fetch(
fn({
nb <- spdep::poly2nb(x, snap = snap, queen = queen, ...)
comp_nb <- spdep::n.comp.nb(nb)
comp_nb[["comp.id"]]
}),
error = \(cnd) {
rep_len(0, length(x))
}
)
}
Dissolve geometry preserving existing or supplied grouping variables
st_dissolve_by <- function(x,
...,
.by = NULL,
.keep = "nest",
do_union = TRUE,
.data_key = "data",
.dissolve_key = "group.comp.id") {
stopifnot(
!rlang::has_name(x, .dissolve_key),
is.data.frame(x)
)
# Handle tidyselect style .by arguments
by <- rlang::enquo(.by)
if (!dplyr::is_grouped_df(x) && !rlang::quo_is_null(by)) {
x <- dplyr::group_by(x, dplyr::across(!!by))
.by <- NULL
}
x_group_vars <- NULL
if (dplyr::is_grouped_df(x)) {
x_group_vars <- dplyr::group_vars(x)
.by <- x_group_vars
x <- dplyr::ungroup(x)
}
sf_column_nm <- attr(x, "sf_column")
# Create dissolve key with poly_2_nb_id
x <- x |>
dplyr::mutate(
"{.dissolve_key}" := paste0(
dplyr::cur_group_id(), ".",
poly_2_nb_id(.data[[sf_column_nm]], ...)
),
.by = .by
)
# Use st_combine or st_union (if `do_union = TRUE`)
sf_summarise_fn <- sf::st_combine
if (do_union) {
sf_summarise_fn <- sf::st_union
}
x_dissolve <- x |>
dplyr::summarise(
# Keep unique values for grouping variables (if supplied)
dplyr::across(
tidyselect::any_of(x_group_vars),
unique
),
# Combine geometry with sf summary function
dplyr::across(
tidyselect::all_of(sf_column_nm),
sf_summarise_fn
),
.by = tidyselect::all_of(.dissolve_key)
)
if (.keep != "nest") {
return(x_dissolve)
}
x_init <- x |>
tidyr::nest(
.by = dplyr::all_of(.dissolve_key),
.key = .data_key
)
x_dissolve |>
dplyr::left_join(
x_init,
by = .dissolve_key
) |>
dplyr::relocate(
dplyr::all_of(sf_column_nm),
.after = tidyselect::everything()
)
}
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
nc_dissolve <- st_dissolve_by(nc)
#> Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
#> ℹ Please use `all_of()` or `any_of()` instead.
#> # Was:
#> data %>% select(.by)
#>
#> # Now:
#> data %>% select(all_of(.by))
#>
#> See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
plot(nc_dissolve[, "group.comp.id"])
nc[["group"]] <- sample(LETTERS[1:5], size = nrow(nc), replace = TRUE)
nc_group_dissolve <- st_dissolve_by(nc, .by = group)
plot(nc_group_dissolve[, "group.comp.id"])
Created on 2024-10-08 with reprex v2.1.1
There is no need to call spdep::n.comp.nb
after spdep::poly2nb
as that functi0on already calls it by default, from 1.3-6 released 4 weeks ago, and returns it as an attribute: attr(<nb>, "ncomp")
. See also https://cran.r-project.org/web/packages/spdep/vignettes/subgraphs.html for more on subgraphs.
Thanks @rsbivand! I've updated the code taking advantage of this fact. Should be a considerable speed up.
As this is being discussed as a feature request for sf
, can we think of an easy way of doing this without introducing a dependency on spdep
?
That's a good question and a good point. I am unsure if it would be as "simple."
As I understand it, we would need to write custom geos C code to replicate the contiguity algorithm—create an RTree, find candidates, check them, etc. To identify contiguous neighbors. Then we'd have to implement a graph traversal algorithm to identify the subgraphs—which...feels like a lot!
@JosiahParry No GEOS code in poly2nb
. For sf
objects - and note that sf
and s2
are fully supported, https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/R/poly2nb.R#L144-L170 shows how the candidate contiguous neighbours are found by 1) in the s2
case intersecting the polygons using the s2
spatial index or 2) in the non-s2
case buffering the bounding boxes of the polygons by +/- snap and taking their intersections.
GEOS could be used from 2010 to now with rgeos
for the same purpose using its STRtree index, to construct a foundInBox
list of candidate neighbours. Prior to 2010, no index lookup was available, https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/R/poly2nb.R#L116-L140 was used to find the bounding box intersections.
In all cases, each candidate pair is checked only once, so above or below the principal diagonal but not both, saving half the time.
Finally, the C looping is done in https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/src/polypoly.c#L216-L351, which checks for 1 boundary point within snap for queen, 2 within snap for rook, stopping when found.
I did test against moving to predicates in sf
, but first finding candidates using spatial indexing, then brute force checking for boundary points within snap seemed more efficient.
To do this in s2
/sf
(two routes), the problem would be to find out whether spatial indexing is good enough to use st_distance
and choose neighbours as polygons closer than snap. We'd still need candidate neighbours to reduce the burden.
Then to get the graph components, we'd need to convert the sgbp
into a sparse matrix (Matrix
, which is recommended, and already suggested in sf
), then an igraph
graph to find the subgraphs (not suggested in sf
).
There is no reason to make any packages needed depends or imports, suggests is sufficient, as anyone needing to do this in general terms should always use the aggregate
method for sf
objects, with the subgraph case only a very minor issue.
If "dissolving" subgraphs is desired, that subset of users could readily use spdep::poly2nb
to generate the subgraph memberships then use aggregate
.
is good enough to use st_distance and choose neighbours as polygons closer than snap.
IIRC, st_is_within_distance()
is an optimised function for that, in any case fast in s2
.