r-spatialecology/landscapemetrics

Running sample_lsm in parallel on a large raster with many smaller landscapes?

Closed this issue · 7 comments

Hello,

Is there an approach currently available within the landscapemetrics package to run the sample_lsm( ) function in parallel, using multiple processors to help reduce total computation time? I am using R v. 4.0.2 and package v. 1.4.5 at the moment.

I have a (single) land-use land-cover raster, spanning North America, within which I would like to summarize a handful of metrics for >2,000 locations (i.e. identified using landscape centroids) within a static circular buffer per landscape. I suspected that this process would take some time, particularly given the size of the landscapes (e.g., involving 39-km radial distances). Preliminary tests on my PC exhibited hours of computation without completion.

To date, bracketing my sample_lsm( ) call with raster::beginCluster(n=ncores) and raster::endCluster( ) does not seem to speed up the process.

Does anyone have any suggestions for approaching this problem using the landscapemetrics package? I can certainly provide example code later on, especially if that is helpful.

Thank you for your time and assistance in advance!

Best,
Andrew

Hi Andrew,

I think @mhesselbarth can say more about this, but as landscapemetrics is written in a tidy way, I believe the future packages would be a good match.

Anyway, I have two points:

  • are you trying to parallelize locally on your PC or on an HPC cluster? And if on an HPC, what scheduler do you use?
  • it would be helpful indeed if you'd provide a toy code example to see exactly what you're trying to do.

Best,
Sebastian

Hi Sebastian @bitbacchus ,

Thanks for reaching out today. I also did not know that the future package could potentially improve the processing time, and I am curious to learn more about such specifications. I am relatively new to parallel computing in R. It's also good to know about the tidy coding behind the package, and so thank you for that clarification as well.

To answer your questions:

  1. I have tried both approaches (local PC and on-campus HPC), and I have been relying more on the HPC as of late. I believe that the scheduler in use is SLURM.

  2. Here is a snippet of code from one of my recent HPC submissions (e.g., I use a separate .sbatch procedure for submitting this job via SLURM):

##################################
##          R Settings          ##
##################################
#Packages and library calls
library(landscapemetrics)
library(rgeos)
library(rgdal)
library(raster) 
library(snow)
library(sp)
set.seed(999)

#Working directory on HPC
homedir <- "/mnt/home/user/Documents/LSM-2020/"
setwd(homedir)

#Global option: prohibit printing output values in terms of scientific notation
options(scipen=999)

#Get today's date for use with workspace save later
datestr <- format(Sys.time(), "%Y-%m-%d")

##################################
##     Preliminary  Analysis      ##
##################################
nt <- 7 #Number of cores requested

#Load geographic data
rasterPath <- "/mnt/home/user/Documents/LSM-2020/CEC 2015/NA_NALCMS_2015_LC_30m_LAEA_mmu5pix_.tif" #Land-use  land-cover dataset
r <- raster(rasterPath) #Raster spans North America
pts <- readRDS("/mnt/home/user/Documents/LSM-2020/pts.rds") #>2,000 landscape centroids in US and CAN

#For choosing metrics, see https://r-spatialecology.github.io/landscapemetrics/reference/list_lsm.html
metricsAll <- list_lsm(level="landscape") #View available metrics for preliminary selection
as.data.frame(metricsAll[(metricsAll$metric=="area_mn") | (metricsAll$metric=="contag") |
                           (metricsAll$metric=="ed") | (metricsAll$metric=="enn_mn") |
                           (metricsAll$metric=="np") | (metricsAll$metric=="pd") |
                           (metricsAll$metric=="pr") | (metricsAll$metric=="shdi"), "metric"])
listMet <- metricsAll[(metricsAll$metric=="area_mn") | (metricsAll$metric=="contag") |
                        (metricsAll$metric=="ed") | (metricsAll$metric=="enn_mn") |
                        (metricsAll$metric=="np") | (metricsAll$metric=="pd") |
                        (metricsAll$metric=="pr") | (metricsAll$metric=="shdi"), ]
beginCluster(n=nt)  #Assuming raster::cluster( ) functions can help optimize the next call             
all.class <- sample_lsm(y=pts, metric=listMet$metric, level="landscape", landscape=r, shape="circle", size=39428.93, plot_id = pts$Centroids) #Compute 8 metrics between >2,000 different landscapes that correspond to survey area locations
endCluster() #Close use of the cluster
warnings()
all.class <- as.data.frame(all.class) #Convert result to classic data frame

################################################################################
#Save workspace data
filename <- paste('HPCC_results_', datestr, '.RData', sep = '')
filepath <- paste(homedir, filename, sep = '')
save(list = ls(all.names = TRUE), file = filepath, envir = .GlobalEnv)
################################################################################

Please let me know if there is anything that I can clarify above.

Best,
Andrew

Hey Andrew,

There is no build-in parallelization in landscapemetrics. However, it should be quite easy to program that yourself. I would basically create a grid that covers your whole landscape (see e.g. the first part of this blog post). Once you have setup the grid, you could easily loop through that grid parallelized, extract the "local landscape" (using raster::crop() and raster::mask()) within the current grid cell and calculate the metrics using e.g. the furrr package.

Cheers,
Max

Hi Max @mhesselbarth ,

Thanks for weighing in as well. I sincerely appreciate the informative blog post you've shared; however, I should reiterate that I am still very new to parallel computing in R, and so (based on your recent suggestions) the one thing that I am missing is the how behind (1) looping over the larger continental grid, (2) extracting the local landscapes one-at-a-time in that loop, (3) computing the desired metrics, and (4) saving the estimated metrics for each local landscape into one data frame, all in parallel, especially using the furrr package.

Is it feasible to provide a small example in code, which demonstrates what you've suggested, especially looping and using functionality in the furrr package? Recall that I have two spatial objects: a continental raster and a SpatialPointsDataFrame of local landscape centroids.

Thank you for your continued support. I sincerely appreciate your time.

In the meantime, please note that I will also be working toward implementing a similar solution in my own R programs, and if I happen to identify one, then I will be sure to post it to this thread for the benefit of others.

Best,
Andrew

This should do the trick:

library(landscapemetrics)     
library(raster)               
library(sf) 
library(sp)

library(future)
library(furrr)

# load data 
data("augusta_nlcd")
my_raster <- augusta_nlcd

# create grid as example. This could be any sp or sf object to loop through
my_grid <- sf::st_make_grid(st_as_sfc(sf::st_bbox(my_raster)), n = c(10, 10))
my_grid <- sf::st_sf(geom = my_grid)

# set to multisession; see ?plan for details
future::plan(multisession)

# loop through each of the large grid cells and calculate metrics for grid cells
result <- furrr::future_map(1:nrow(my_grid), function(i) {
 
  # crop sample plot
  augusta_nlcd_temp <- raster::crop(x = augusta_nlcd,
                                    y = sf::as_Spatial(my_grid[i, ]))

  # mask sample plot
  augusta_nlcd_temp <- raster::mask(x = augusta_nlcd_temp,
                                    mask = sf::as_Spatial(my_grid[i, ]))

  # calculate desired metrics
  landscapemetrics::calculate_lsm(landscape = augusta_nlcd_temp, 
                                  what = c("lsm_p_area", "lsm_c_ca", "lsm_l_ta"))
})

Cheers,
Max

Hi Max @mhesselbarth ,

Thank you very much for providing the example code above. Your demo has helped me understand the problem quite a bit more. Below, I have adapted your code to more closely match my purposes (e.g., using spatial points rather than a grid), and I look forward to testing it on our HPC soon.

# for metrics and data manipulation
library(landscapemetrics)     
library(raster)               
library(sf) 
library(sp)
library(dplyr)

# for parallel processing
library(future)
library(furrr)

# load the data
r <- raster("/mnt/home/user/Documents/LSM-2020/CEC 2015/NA_NALCMS_2015_LC_30m_LAEA_mmu5pix_.tif")
pts <- readRDS("/mnt/home/user/Documents/LSM-2020/pts.rds")

# create polygon buffers around local-landscape centroids and use as replacement for my_grid (see above)
my_pts <- sf::st_buffer(st_as_sfc(pts), 39428.93)

# set to multisession; see ?plan for details
future::plan(multisession)

# loop through each of the polygon buffers around local-landscape centroids and calculate metrics for them
my_metric <- furrr::future_map(1:nrow(my_pts), function(i) {
  
  # crop sample plot
  my_raster_temp <- raster::crop(x = my_raster,
                                    y = sf::as_Spatial(my_pts[i, ]))
  
  # mask sample plot
  my_raster_temp <- raster::mask(x = my_raster_temp,
                                    mask = sf::as_Spatial(my_pts[i, ]))
  
  # calculate desired metrics
  landscapemetrics::calculate_lsm(landscape = my_raster_temp, level="landscape",
                                  metric = "ent")
})
warnings()
my_metric

# merging results; see https://nowosad.github.io/post/lsm-bp2/
my_result = bind_cols(my_pts, my_metric)
plot(my_result["value"])

Hoping for the best,
Andrew

Looks good. You should be actually able to use future_map_dfr() I think which allows you to skip the bind_cols() function. If you want to use a HPC the clustermq package might be also of interest to you. If your HPC uses SLURM, this homepage might help you as well.

Will close this for now.