bwlewis/doRedis

Task gets resubmitted when worker is just working on it; last task was not processed at all

Opened this issue · 8 comments

It happened to me that the first task was re-submitted while worker was working on it.

Looking at the console of a single worker (and also the master), the last task (of the total 4) was not processed at all, while the first task was processed twice. Foreach returned list where the last task was missing (list of length only 3). After the foreach job was done, the worker kept writing this on the console:

Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
Processing job 1 from queue jobs
...

On the master, the code is approximatelly sketched below. I always clean the queue by using special trick (see http://stackoverflow.com/q/25947991/684229)

    removeQueue('jobs')
    registerDoRedis('jobs', redis_server)
    r <- foreach (i = c(226, 229, 230, 246), .errorhandling = 'pass', .verbose = TRUE) %dopar% {
       ... really long code (running JAGS using runjags package)
    }
    removeQueue('jobs') # to clean up the queue
    registerDoRedis('jobs', redis_server) # so that worker doesn't stop looping

This is the console output of the verbose foreach loop:

numValues: 4, numResults: 0, stopped: TRUE
automatically exporting the following objects from the local environment:
bird_dataset_nick, counts, envir, formulas, pop_model, read_poi, species, var_nick
Warning in e$fun(obj, substitute(ex), parent.frame(), e$data) :
Worker fault, resubmitting task 1.
got results for task 1
numValues: 4, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 1
numValues: 4, numResults: 2, stopped: TRUE
returning status FALSE
got results for task 2
numValues: 4, numResults: 3, stopped: TRUE
returning status FALSE
got results for task 3
numValues: 4, numResults: 4, stopped: TRUE
calling combine function
evaluating call object to combine results:
fun(accum, result.1, result.2, result.3)
returning status TRUE

I use R 3.1.0, doRedis 1.1.1, rredis 1.6.9 and redis server 2.6.12, all on a single host Windows XP.

I don't know how to reproduce the issue - it happens only sometimes.

EDIT: another case:

Output of foreach on the master:

numValues: 4, numResults: 0, stopped: TRUE
automatically exporting the following objects from the local environment:
bird_dataset_nick, counts, data, env_formula, envir, euring, f_ind, formula, formulas, no_sites, poi, pop_model, read_poi, sciname, species, species.use, var_nick
Warning in e$fun(obj, substitute(ex), parent.frame(), e$data) :
Worker fault, resubmitting task 2.
got results for task 2
numValues: 4, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 1
numValues: 4, numResults: 2, stopped: TRUE
returning status FALSE
got results for task 3
numValues: 4, numResults: 3, stopped: TRUE
returning status FALSE
got results for task 2
numValues: 4, numResults: 4, stopped: TRUE
calling combine function
evaluating call object to combine results:
fun(accum, result.1, result.2, result.3)
returning status TRUE

Again, the results for the last task (4) were not collected. Examining the state of the redis server shows that the results remained there in the queue and can be collected there:

redisKeys("*")
[1] "jobs:counter" "jobs:1.results" "jobs:workers"
redisGet("jobs:counter")
[1] "3"
redisGet("jobs:workers")
[1] "2"
r <- redisLRange("jobs:1.results", 0, -1)
save(r, file = "jobs_1.results.Rdata")
str(r)
List of 1
$ :List of 1
..$ 4:List of 7
.. ..$ ...

I have the same problem, is there a solution or some kind of work around yet?

Yes, this should be fixed in the development version on GitHub, have you tried that?

devtools::install_github("bwlewis/doRedis")

(be sure to install the new version of doRedis on all the computers participating in the culster)

Let me know if you still have problems with that version. It is hopefully very close to ready for a new CRAN release...

Yes, my test case now works with the development version! Many thanks for your quick response and generally for the doRedis package! It has helped me quite a bit!

@thomaskisler you have any testcase? Please post it! I have not been able to reproduce the issue! Thanks, Tomas

@tomastelensky Here is my code. I think the key to being able to reproduce the error is setting a password for the redis-server. This took me quite a while to figure out, after the error did not appear anymore after making the script shorter. So you have to add something like the following to your redis configuration file (in case you are not already doing this):

requirepass ThePasswordIsImportantForReproduction

My master R script looks the following, that submits 8 tasks to the database:

library(foreach)
library(doRedis)
library(uuid)

options('redis:num'=TRUE) #fixing the "invalid format '%.0f'; use format %s for character object" bug -> http://stackoverflow.com/questions/31939951/doredis-on-windows-7-gives-error-as-soon-as-foreach-loop-is-run

REDIS_PASSWORD = "ThePasswordIsImportantForReproduction"
REDIS_PORT = 6379

registerDoRedis("jobs",password=REDIS_PASSWORD,port=REDIS_PORT)

tmpDirectory = '/tmp/' #tempdir()
setwd(tmpDirectory)

print(paste("Saving stuff to",tmpDirectory))

taskDF <- data.frame(id=seq(1,8,by=1))

foreach(j = 1:dim(taskDF)[1], 
        .export=c("tmpDirectory","taskDF"),
        .packages = c("uuid")
        ) %dopar% {
  #get the task from the task data frame
  currTask = taskDF[j,]

  #getting something unique
  uniqPart = UUIDgenerate()

  Sys.sleep(180) #instead of the loading/processing

  # write file for each iteration with unique part, so every thread saves it's own file(s)
  currFileToWrite = paste(tmpDirectory,"/","task-",currTask,"-completed","-",uniqPart,".csv",sep ="")
  print(paste("Writing to file:",currFileToWrite,"-",date()))
  write.csv("Dummy content", currFileToWrite, row.names = F)

  return(T)
}

and the worker script looks like this:

library(doRedis)

REDIS_PASSWORD = "ThePasswordIsImportantForReproduction"
REDIS_PORT = 6379

startLocalWorkers(n=2, queue="jobs",password = REDIS_PASSWORD,port = REDIS_PORT)

I hope this helps!

Thanks @thomaskisler! This is very interesting, because I did not use any redis password when the issue came up!! And it came up often. Not in a predictable way, but quite often.

Just for future reference: yes, it is possible that this are two unrelated problems. In my case, the error gets resolved with the new development version.

@bwlewis, in 2016 you wrote "should be fixed in the development version". What is the current status of this issue? Is this fix in version 2.0.0?

PS: I guess you're the one who's supposed to close it according to the workflow once it's fixed, or is it waiting for me to verify? (I am not so experienced in github)