morrowcj/remotePARTS

fitAR_map needs parallel version

Opened this issue · 4 comments

Problem

With large datasets, fitAR_map() (and perhaps fitCLS_map()) is exceptionally slow - it can take longer than the partitioned GLS, since it only runs in serial (1 thread). This is clearly an oversight.

Solution

This functionality lends itself to parallelization well. AR models are fit to each pixel independently. So, it makes sense to implement a parallel version of fitAR_map() via an argument ncores.

Early on in the development of remotePARTS, I gave examples about how to apply fitAR() to a full map with the raster package. But, since this package doesn't officially support rasters, an implementation that can handle matrices (and dataframes) is important.

I've been successful with a script containing the following code that seems to work well:

iterator-function.R:

iblkrow <- function(a, chunks) {
  n <- nrow(a)
  i <- 1
  
  nextElem <- function() {
    if (chunks <= 0 || n <= 0) stop('StopIteration')
    m <- ceiling(n / chunks)
    r <- seq(i, length=m)
    i <<- i + m
    n <<- n - m
    chunks <<- chunks - 1
    a[r,, drop=FALSE]
  }  
  structure(list(nextElem=nextElem), class=c('iblkrow', 'iter'))
}
nextElem.iblkrow <- function(obj) obj$nextElem()

parallel-AR.R:

# ... setup
  library(doParallel)
  library(foreach)
  registerDoParallel(core_num)
  source("iterator-function.R")  # loads helper functions including the iterator iblkrow()
  
  # loop through blocks of the data (using helper function)
  AR_results = foreach(y = iblkrow(data_matrix, core_num),
                       crds = iblkrow(coord_matrix, core_num),
                       .packages = c("remotePARTS", "foreach"), .inorder = TRUE, .combine = rbind) %dopar% {
                         # loop through each row of the chunks
                         out = foreach(i=seq_len(nrow(y)), .combine = rbind, .inorder = TRUE) %do% {
                           t = seq_len(ncol(y))  # time
                           AR = fitAR(as.formula(as.vector(y[i, ]) ~ t))  # AR fit
                           # return a collection of AR variables of interest
                           c(coef = unname(AR$coefficients["t"]),
                             pval = unname(AR$pval["t"]),
                             resids = AR$residuals)
                         }
                         rownames(out) = NULL # remove rownames "t"
                         out # return the combined results
                       }

@morrowcj , thank you very much for ,making the parallel version - this is very helpful!
Is there something missing in the second row of the loop?
crds = iblkrow(coord_matrix, ...?

You are correct, @PlekhanovaElena. I somehow lost an argument and the closing parenthesis. That line should read crds = iblkrow(coord_matrix, core_num),. I've edited my comment above to reflect this.

And just in case it wasn't clear, core_num is the total number of cores you'd like to use for the operation, data_matrix is a matrix containing the spatial time series data (rows are locations, columns are time steps) and coord_matrix is the coordinate matrix (rows are locations and columns are longitude and latitude, respectively.

The above code (or something similar) should eventually be implemented in the package. Hopefully I'll have time to get to this soon.