mikemahoney218/mm218.dev

posts/2023-06-06-spatialsample_splits/

Opened this issue · 2 comments

Mike Mahoney - From the inbox: How can I get fold assignments from spatialsample?

Straightforward methods for answering a straightforward question.

https://www.mm218.dev/posts/2023-06-06-spatialsample_splits/

Hi Mike,
This is super useful. I have a question related to the package, but not to this question in specific.

Is there a way to specify to cluster the spatial data frame based on more than one variable? Let's say that I have a spatial data frame with 3 variables in total: crime, income, geometry.
When I want to generate cluster only by "crime", I delete the "income" variable and run the "spatial_clustering_cv". However, I would like to know if there is a way the "spatial_clustering_cv" can identify clusters based on both socio-economic variables "crime" and "income".

Many thanks in advance!

Hi @adrianuzkcc !

So this changed in January this year, as part of spatialsample 0.3.0. Functions in spatialsample now only accept sf objects, and only assign to folds based on the geometry column in those sf objects. This is motivated in part because otherwise you're assuming that "crime" or "income" are in the same units as your spatial data, and that a unit of distance along any of these axes is equally important -- that a 1m change in spatial distance is the same as a 1 dollar change in income or a 1 unit change in crime.

There's some interesting work being done on blocking based upon both spatial locations and predictor variables, including this paper from last week/next month. None of that has made its way into spatialsample yet -- I'd like to see more people talking about/using these types of methods before I commit to maintaining code for them long-term! -- but I would love for them to eventually get added to the package.

If you do have a situation where you've got a lot of variables that share units and are equally important (or any data with a meaningful non-spatial "distance" metric), check out the new-ish rsample::clustering_cv(). This is a really flexible function which lets you specify your variables, as well as your own distance and clustering functions, in order to perform clustering on any set of variables that makes sense for your problem.

Hope that answers your question.