DrylandEcology/rSFSW2

Test project tests fail on Linux machines

nip5 opened this issue · 1 comments

nip5 commented

Test project tests fail on Linux machines due to "NA" being coerced to NA in Linux and thus site 4 is skipped on database generation.

@CaitlinA @nip5

Did PR #360 close this issue?

I don't think that this issue description is helpful, e.g., there is no reproducible example; there is no example output of the error message, it doesn't state which database (out of >= 3), etc.

Also claiming that this is an issue for (all) Linux machines is not helpful; travis-ci which we run with Ubuntu 14.04.5 LTS did not produce this issue (https://travis-ci.org/DrylandEcology/rSFSW2/builds/502228791 passing the previous PR for rSFW2 v3.1.3): i.e.,

[1] "rSFSW2's 'make_dbOutput': started at 2019-03-05 20:38:51"
[1] "rSFSW2's 'make_dbOutput': ended after 0.75 s"

Maybe it fails on your specific Linux machine, but then writing what setup you use would be more helpful for debugging. As it is, I see no specific reason why this was an issue with the R code of rSFSW2 and not with your specific setup.

Commit bdc6062 introduces new code to replace any NA (which carry a specific meaning in R) in weather folder names with a text string (of no specific meaning). I find this confusing without further explanation.

We used to remove NAs in weather folder names with na.exclude() (

temp <- unique(stats::na.exclude(
). Contrary to that old code snippet, it appears that when I changed our test project about 10 months ago, I changed how we re-write the InputMaster csv, i.e., we started to write "NA" instead of NA to the csv file (db3948e). Thus, the reference output database contains an entry for site 4 "NA". I guess that maybe my local computer and travis-ci, translate this into "NA" while yours reads it as NA from disk? The "NA" has ended up in the reference output database because it was no longer removed by na.exclude. Thus, if this is the problem, then I guess that instead of laboriously rewriting the NAs into "NA"s, simply removing the na.exclude (maybe with an additional as.character(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder) if needed) would produce the same result as the reference output database, but much easier and faster?

Since, originally these NAs indicated non-included sites, we may want to consider whether we prefer eventually to revert to the old behavior, i.e., not include those sites in the weatherfolders table?

Either way, looping over vectors tends to be slow and in our use cases site_data could have many rows; however, the code is executed only once and thus it does not matter much for the overall performance of a simulation run. Still, you may want to read up on vectorized code (e.g., https://datascienceplus.com/strategies-to-speedup-r-code/), i.e., the code in the commit

  for (i in seq(sites_data$WeatherFolder)){
    if (is.na(sites_data$WeatherFolder[i])){
      sites_data$WeatherFolder[i] <- "NA"
    }
  }

is equivalent to

sites_data$WeatherFolder[is.na(sites_data$WeatherFolder)] <- "NA"

Here, an example with 100,000 sites -- a reasonable setup for one of our simulation experiments: the for-loop is (on my machine) over 1,300 times slower...

sites_data <- data.frame(WeatherFolder = as.character(sample(1e5)),
  stringsAsFactors = FALSE)
sites_data$WeatherFolder[c(10, 50000)] <- NA

f1 <- function(x) {
  for (i in seq(x$WeatherFolder)){
    if (is.na(x$WeatherFolder[i])){
      x$WeatherFolder[i] <- "NA"
    }
  }
  x
}

f2 <- function(x) {
  x$WeatherFolder[is.na(x$WeatherFolder)] <- "NA"
  x
}

identical(f1(sites_data), f2(sites_data))

microbenchmark::microbenchmark(f1(sites_data), f2(sites_data))

But again, simply removing na.exclude from the call in line 1843 should have done the trick as well, i.e.,

      temp <- unique(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder)