[Feature] Dissolve Boundaries

Question

[Feature] Dissolve Boundaries

JosiahParry opened this issue 4 months ago · 13 comments

A common GIS task is to dissolve boundaries based on shared boundaries of polygons. Doing this with R always bends my head a little bit.

I think having a utility function in {sf} to do this would be very handy.

Here is a minimal example of how that function might work. Using spdep is probably the best for this task because it already has functions for identifying contiguous polygons as well as the connected subgraphs.

This is inspired by @CGMossa's PhD work

dissolve_boundaries <- function(x) {
  if (!requireNamespace("spdep")) {
    stop("spdep is needed to use this function")
  }

  crs <- sf::st_crs(x)
  nbs <- spdep::poly2nb(x)
  nb_ids <- attr(nbs, "ncomp")[["comp.id"]]
  res <- lapply(
    split(x, nb_ids),
    sf::st_union,
    is_coverage = TRUE
  )

  sf::st_sfc(
    unlist(res, recursive = FALSE),
    crs = crs
  ) |> sf::st_combine()

}

Answer 1 · 2024-08-12T13:16:43.000Z

It took me a few minutes to understand what exactly this function would do in comparison to st_cast(st_union(x), "POLYGON") so I made a little reprex to show it.

library(sf)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

set.seed(241)
test = nc[sample(nrow(nc), 18),] |> st_geometry()

a = test |> st_union() |> st_as_sf()
a$id = 1:nrow(a)
b = a |> st_cast("POLYGON") |> st_as_sf()
#> Warning in st_cast.sf(a, "POLYGON"): repeating attributes for all
#> sub-geometries for which they may not be constant
b$id = 1:nrow(b)
c = test |> dissolve_boundaries() |> st_as_sf() 
#> Loading required namespace: spdep
c$id = 1:nrow(c)

b
#> Simple feature collection with 11 features and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     id                              x
#> 1    1 POLYGON ((-83.1615 35.05922...
#> 1.1  2 POLYGON ((-76.00897 36.3196...
#> 1.2  3 POLYGON ((-75.97629 36.5179...
#> 1.3  4 POLYGON ((-75.78317 36.2251...
#> 1.4  5 POLYGON ((-82.74389 35.4180...
#> 1.5  6 POLYGON ((-78.95108 36.2338...
#> 1.6  7 POLYGON ((-76.70538 35.4119...
#> 1.7  8 POLYGON ((-76.6949 35.35043...
#> 1.8  9 POLYGON ((-81.659 36.11759,...
#> 1.9 10 POLYGON ((-80.45065 35.7648...
nrow(b)
#> [1] 11

c
#> Simple feature collection with 8 features and 1 field
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS:  NAD27
#>                                x id
#> 1 MULTIPOLYGON (((-75.97629 3...  1
#> 2 POLYGON ((-77.98668 34.3399...  2
#> 3 POLYGON ((-78.95108 36.2338...  3
#> 4 MULTIPOLYGON (((-76.70538 3...  4
#> 5 POLYGON ((-82.24016 35.4681...  5
#> 6 POLYGON ((-81.659 36.11759,...  6
#> 7 POLYGON ((-80.50825 36.0708...  7
#> 8 POLYGON ((-83.1615 35.05922...  8
nrow(c)
#> [1] 8

cols = paletteer::paletteer_d("tvthemes::gravityFalls")
par(mfrow = c(2,2), mar = c(0,0,1,0))
plot(test, main = "Original geoms")
plot(a, main = "st_union()",
     key.pos = NULL, reset = FALSE, pal = cols)
plot(b, main = "st_union |> st_cast('POLYGON')",
     key.pos = NULL, reset = FALSE, pal = cols)
plot(c, main = "dissolve_boundaries()",
     key.pos = NULL, reset = FALSE, pal = cols)

It would be cool to have this in sf! Would this fit better as a possible st_dissolve() function or as a special case of st_union()?

Answer 2 · 2024-08-28T18:46:30.000Z

Thanks @loreabad6 that clarifies! My guess is that the only difference between the bottom two plots is a single queen neighbour, and I'm not sure that is what you want with "dissolve".

Answer 3 · 2024-09-07T00:35:35.000Z

Forgive me if this is not the right thread to discuss this, but I have always wondered why {sf} didn't have an existing st_dissolve() function that does something similar to the following:

object_dissolved <- object %>%
  group_by(attribute) %>%
  summarize(
    geometry = sf::st_union(geometry), # or SHAPE in case of data loaded from GDB
    across(everything(), first),
    .groups = "drop"
  )

Here it is easy since I have single aggregation for all fields (i.e. first), however, it might be complicated to use different aggregate functions for different columns. For instance, I might want to sum() an attribute called area, take a mean() of population, and calculate median() of median_household_income. In that case, it might be much easier to group_by() and st_union() and specify aggregate function separately for different columns. While it would be cool to have a st_dissolve() function, I am unsure how difficult would it be to implement multiple aggregation within the same function.

Answer 4 · 2024-09-09T10:14:14.000Z

@ar-puuk , I may have a small tidyverse-free function for what you describe here. Not bullet proof, though...

Answer 5 · 2024-09-09T10:25:14.000Z

Could I ask why the aggregate method for sf objects is not adequate for these needs?

Answer 6 · 2024-09-10T01:14:04.000Z

Could I ask why the aggregate method for sf objects is not adequate for these needs?

I do not want to hijack the initial issue, but the aggregate function is much less intuitive as is, which is not consistent with other SF functions. If I were to want to take the mean of numeric columns, sum the integer column, and take the first values of the character column, things would get messy really fast. I ended up using group_by() and summarize().

# Select only the numeric columns
numeric_cols <- sapply(wfrc_taz, is.numeric)

# Perform the aggregation only on numeric columns
wfrc_taz_agg <- aggregate(
  wfrc_taz[, numeric_cols],                   # Subset to numeric columns
  by = list(CO_NAME = wfrc_taz$CO_NAME),      # Aggregate by CO_NAME
  FUN = sum
)

However, the function provided by @rCarto seems to be the exact one I was looking for. Thank you so much. The only wish/question I have is if it would work with logical filters such as in the code below. Also, maybe the fun argument could take in the actual function (again for consistency), rather than a character of the function name.

r <- st_aggregate(nc, "dummy", c(is.character(), is.integer(), is.numeric()), c(first, sum, mean))

In that case, we would expand the by argument to take in values such as r colnames %in% c("BIR74", "NWBIR74") and so on to apply different aggregation functions to different sets of columns.

Answer 7 · 2024-10-08T03:35:25.000Z

A pipe-friendly st_dissolve function would be really helpful!

I took the sample provided by @JosiahParry and turned it into a more involved pair of helper functions that handles grouped sf data input (updated this reprex to better handle the .by grouping argument).

Safely use spdep::poly2nb and spdep::n.comp.nb to get an index of neighboring
features

poly_2_nb_id <- function(x,
                         snap = NULL,
                         queen = TRUE,
                         quiet = TRUE,
                         ...) {
  rlang::check_installed("spdep")
  fn <- invisible
  if (quiet) {
    fn <- suppressWarnings
  }

  rlang::try_fetch(
    fn({
      nb <- spdep::poly2nb(x, snap = snap, queen = queen, ...)
      comp_nb <- spdep::n.comp.nb(nb)
      comp_nb[["comp.id"]]
    }),
    error = \(cnd) {
      rep_len(0, length(x))
    }
  )
}

Dissolve geometry preserving existing or supplied grouping variables

st_dissolve_by <- function(x,
                           ...,
                           .by = NULL,
                           .keep = "nest",
                           do_union = TRUE,
                           .data_key = "data",
                           .dissolve_key = "group.comp.id") {
  stopifnot(
    !rlang::has_name(x, .dissolve_key),
    is.data.frame(x)
  )

  # Handle tidyselect style .by arguments
  by <- rlang::enquo(.by)
  if (!dplyr::is_grouped_df(x) && !rlang::quo_is_null(by)) {
    x <- dplyr::group_by(x, dplyr::across(!!by))
    .by <- NULL
  }

  x_group_vars <- NULL

  if (dplyr::is_grouped_df(x)) {
    x_group_vars <- dplyr::group_vars(x)
    .by <- x_group_vars
    x <- dplyr::ungroup(x)
  }

  sf_column_nm <- attr(x, "sf_column")

  # Create dissolve key with poly_2_nb_id
  x <- x |>
    dplyr::mutate(
      "{.dissolve_key}" := paste0(
        dplyr::cur_group_id(), ".",
        poly_2_nb_id(.data[[sf_column_nm]], ...)
      ),
      .by = .by
    )

  # Use st_combine or st_union (if `do_union = TRUE`)
  sf_summarise_fn <- sf::st_combine
  if (do_union) {
    sf_summarise_fn <- sf::st_union
  }

  x_dissolve <- x |>
    dplyr::summarise(
      # Keep unique values for grouping variables (if supplied)
      dplyr::across(
        tidyselect::any_of(x_group_vars),
        unique
      ),
      # Combine geometry with sf summary function
      dplyr::across(
        tidyselect::all_of(sf_column_nm),
        sf_summarise_fn
      ),
      .by = tidyselect::all_of(.dissolve_key)
    )

  if (.keep != "nest") {
    return(x_dissolve)
  }

  x_init <- x |>
    tidyr::nest(
      .by = dplyr::all_of(.dissolve_key),
      .key = .data_key
    )

  x_dissolve |>
    dplyr::left_join(
      x_init,
      by = .dissolve_key
    ) |>
    dplyr::relocate(
      dplyr::all_of(sf_column_nm),
      .after = tidyselect::everything()
    )
}

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))

nc_dissolve <- st_dissolve_by(nc)
#> Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
#> ℹ Please use `all_of()` or `any_of()` instead.
#>   # Was:
#>   data %>% select(.by)
#> 
#>   # Now:
#>   data %>% select(all_of(.by))
#> 
#> See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

plot(nc_dissolve[, "group.comp.id"])

nc[["group"]] <- sample(LETTERS[1:5], size = nrow(nc), replace = TRUE)

nc_group_dissolve <- st_dissolve_by(nc, .by = group)

plot(nc_group_dissolve[, "group.comp.id"])

^{Created on 2024-10-08 with reprex v2.1.1}

Answer 8 · 2024-10-08T18:19:09.000Z

There is no need to call spdep::n.comp.nb after spdep::poly2nb as that functi0on already calls it by default, from 1.3-6 released 4 weeks ago, and returns it as an attribute: attr(<nb>, "ncomp"). See also https://cran.r-project.org/web/packages/spdep/vignettes/subgraphs.html for more on subgraphs.

Answer 9 · 2024-10-08T18:55:28.000Z

Thanks @rsbivand! I've updated the code taking advantage of this fact. Should be a considerable speed up.

Answer 10 · 2024-10-08T19:14:23.000Z

As this is being discussed as a feature request for sf, can we think of an easy way of doing this without introducing a dependency on spdep?

Answer 11 · 2024-10-08T19:31:52.000Z

That's a good question and a good point. I am unsure if it would be as "simple."

As I understand it, we would need to write custom geos C code to replicate the contiguity algorithm—create an RTree, find candidates, check them, etc. To identify contiguous neighbors. Then we'd have to implement a graph traversal algorithm to identify the subgraphs—which...feels like a lot!

Answer 12 · 2024-10-09T08:19:52.000Z

@JosiahParry No GEOS code in poly2nb. For sf objects - and note that sf and s2 are fully supported, https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/R/poly2nb.R#L144-L170 shows how the candidate contiguous neighbours are found by 1) in the s2 case intersecting the polygons using the s2 spatial index or 2) in the non-s2 case buffering the bounding boxes of the polygons by +/- snap and taking their intersections.

GEOS could be used from 2010 to now with rgeos for the same purpose using its STRtree index, to construct a foundInBox list of candidate neighbours. Prior to 2010, no index lookup was available, https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/R/poly2nb.R#L116-L140 was used to find the bounding box intersections.

In all cases, each candidate pair is checked only once, so above or below the principal diagonal but not both, saving half the time.

Finally, the C looping is done in https://github.com/r-spatial/spdep/blob/dbd3074b8822ec669fea628061bce88fe77c5bf2/src/polypoly.c#L216-L351, which checks for 1 boundary point within snap for queen, 2 within snap for rook, stopping when found.

I did test against moving to predicates in sf, but first finding candidates using spatial indexing, then brute force checking for boundary points within snap seemed more efficient.

To do this in s2/sf (two routes), the problem would be to find out whether spatial indexing is good enough to use st_distance and choose neighbours as polygons closer than snap. We'd still need candidate neighbours to reduce the burden.

Then to get the graph components, we'd need to convert the sgbp into a sparse matrix (Matrix, which is recommended, and already suggested in sf), then an igraph graph to find the subgraphs (not suggested in sf).

There is no reason to make any packages needed depends or imports, suggests is sufficient, as anyone needing to do this in general terms should always use the aggregate method for sf objects, with the subgraph case only a very minor issue.

If "dissolving" subgraphs is desired, that subset of users could readily use spdep::poly2nb to generate the subgraph memberships then use aggregate .

Answer 13 · 2024-10-09T10:09:40.000Z

is good enough to use st_distance and choose neighbours as polygons closer than snap.

IIRC, st_is_within_distance() is an optimised function for that, in any case fast in s2.