HenrikBengtsson/future.batchtools

Eqw on SGE cluster while R code finishes without error

Luqing-Zhang opened this issue · 0 comments

Hi,

I use future.batchtools a lot with our SGE cluster. All of a sudden, I begin to encounter a problem, the future.batchtools chunk of code inside R finishes without any error. But among 1563 jobs submitted, a few (1~5 randomly) will finally become Eqw with an error like below (by qstat -j $jobname).

03/06/2021 17:58:23 [1506654697:17781]: error: can't open stdout output file "/pQTL/.future/20210306_161441-5LTXzg/future_lapply-72_293396299/logs/job99655decd726a355cf6b8d8746efadcf.log": No such file or directory
scheduling info: (Collecting of scheduler job information is turned off)

Inside the .future folder, there are about 200 job folders not removed by future and the specific folder in the error message doesn't exist at all( I suppose it has been removed by future automatically).

Based on my previous experience with future.batchtools, if a job finishes correctly, the folder of the job inside .future will be removed. If the job has any error, the folder will not be removed and there will be an error message inside R.
Questions:
The situation is R code finishes without error, why there are hundreds of future jobs folders left inside the .future folder (among 1563 jobs, roughly 200 folders left and the log file is empty)? Why they are not removed even finished correctly?

Why there are few Eqw on the SGE cluster that has no corresponding folder(The folder does exist at the time we start future SGE jobs, it seems future considers them as successfully finished jobs and removed them).

Do you think this is an issue of the future.batchools package or an issue I should go to our HPC infrastructure team? Thanks much!