matloff/partools

More helper functions

Opened this issue · 3 comments

I think the code in partools could be made substantially more elegant with a few helper functions:

clusterRequireNamespace <- function(cl, pkg) {
  clusterCall(cl, requireNamespace, package = pkg, quietly = TRUE)
}
clusterRequirePackage <- function(cl, pkg) {
  clusterCall(cl, library, package = pkg, character.only = TRUE)
}
clusterEval < function(cl, expr) {
  clusterCall(cl, eval, expr, env = .GlobalEnv)
}

clusterAssign <- function(cl, name, value) {
  clusterCall(cl, assign, x = name, value = value)
}

asCall <- function(x) {
  if (inherits(x, "formula")) {
    stopifnot(length(x) == 2)
    x[[2]]
  } else if (is.atomic(x) || is.name(x) || is.call(x)) {
    x
  } else {
    stop("Unknown input")
  }
}

clusterEvalI <- function(cl, expr, ...) {
  args <- lapply(list(...), asCall)
  interpolated <- methods::substituteDirect(asCall(expr), args)

  clusterEval(cl, interpolated)
}

This allows (e.g.) fileread to be rewritten as:

fileread <- function (cls, fname, dname, ndigs, header = FALSE, sep = " ", 
                      usefread = FALSE) {
  if (usefread) {
    clusterRequireNamespace(cls, "data.table")
    clusterEvalI(cls, ~ myfread <- data.table::fread))
  } else {
    clusterEvalI(cls, ~ myfread <- read.table))
  }

  clusterEvalI(cl,
    ~ `_mychunk` <- filechunkname(fname, ndigs), 
    fname = fname, ndig
    s = ndgs
  )

  clusterEvalI(cls,
    ~ dname <- myfread(`_mychunk`, header = header, sep = sep),
    dname = as.name(dname),
    header = header,
    sep = sep
  )
}

which I think is rather more clear since it focusses on the computation, rather than the details of the call manipulation (and working with the AST is obviously more elegant than working with strings)

(Also note the use of _mychunk as a variable name - using a non syntactic variable should decrease the chances of accidentally clobbering an object that the user created)

I haven't tested this code yet, but I'm happy to do so if you're interested in the idea.

Thanks, I'll look into this. Actually, there are lots of ways in which the code could be made more elegant. :-)

Making changes now to make these sorts of functions more clear. For example:

Before:

filesave <- function(cls,dname,newbasename,ndigs,sep) {

# what will the new file be called at each node?
tmp <- paste('"',newbasename,'",',ndigs,sep='')
cmd <- paste('myfilename <- filechunkname(',tmp,')',sep='')
clusterExport(cls,"cmd",envir=environment())
clusterEvalQ(cls,eval(parse(text=cmd)))
# start building the write.table() call
tmp <- paste(dname,'myfilename',sep=',')
# what will the column names be for the new files?
clusterEvalQ(cls,eval(parse(text=cmd)))
cncmd <- paste('colnames(',dname,')',sep='')
clusterExport(cls,"cncmd",envir=environment())
clusterEvalQ(cls,cnames <- eval(parse(text=cncmd)))[[1]]
# now finish pasting the write.table() command, and run it
writecmd <- paste('write.table(',tmp,
   ',row.names=FALSE,col.names=cnames,sep="',sep,'")',sep='')
clusterExport(cls,"writecmd",envir=environment())
clusterEvalQ(cls,eval(parse(text=writecmd)))

After:

filesave <- function(cls,dname,newbasename,ndigs,sep, ...) {
    write_one_chunk <- function(dname, newbasename, ndigs, sep, ...)
    {   
        write.table(get(dname)
                    , file = filechunkname(newbasename, ndigs)
                    , sep = sep 
                    , ...)
    }   

    clusterCall(cls, write_one_chunk, dname, newbasename, ndigs, sep, ...)
}

This was motivated by needing to pass more arguments into the underlying functions like write.table(). If there are no objections I'd like to continue refactoring the code in this way.