nils-braun/b2luigi

Some project submissions happen after start of the project monitoring

Closed this issue · 3 comments

Originally posted by @Bilokin in #129 (comment):

The gbasf2 project submission algorithm does not submit all projects first and then waits for them to finish, but rather some project submissions happen after start of the project monitoring. This is not optimal and we need to ensure that all gbasf2 have been submitted at the start of the b2luigi process. I am still not sure why this happens, but do you have an idea how to fix the issue?

Hello @Bilokin, you mentioned this issue in #129 but as far as I understand this is a separate issue, so if I created a new issue to keep the discussion on the unrelated issue clean.

At the moment I don't understand what the problem is, therefore I would kindly ask you to clarify it.

some project submissions happen after start of the project monitoring

Do you mean we should not monitor any projects until all projects have been submitted? And by monitoring I mean regularly calling the get_job_status method in b2luigi check the project status. Do I understand correctly? Why is this not optimal? Maybe because the communication to the gbasf2/dirac servers is a limited bottleneck and checking a job status can slow down submission? I'm guessing because you didn't clarify.

Each gbasf2-project submitted with b2luigi is handled as a separate task and the tasks are handled independently of each other, with one worker per task. Once a task gets assigned a worker, it is submitted and then monitored, and usually it has no information about other tasks/workers. Still, sometimes we want to hold back submission of tasks, for example if the user limits the number of workers, e.g. with workers=1 only one project will get processed at a time.

So bascially, as I understand, we want to have some global state which gives us the number of tasks currently being handled by the b2luigi gbasf2 process minus the tasks which have successfully been submitted. If this quantity is zero, all tasks have been submitted. And we should tell the b2luigi.process logic to only proceed with task monitoring if this is zero, otherwise wait. And we should then properly handle tasks where submission failed. This is possible, but might become quite complex and introduce new issues,

Therefore, if your problem is the cost of the project monitoring slowing down the project submission, I would suggest trying to do the project monitoring less costly. A very simple solution would be just to update the status less often. I didn't check the luigi documentation yet if there's an option to configure the checking interval, but it would be e.g. also possible to cache the result of get_job_status for say 1 minute. This mean we could force it to just return the previous result of the function call until 1 minute has passed and only then refresh it by calling the gbasf2 function. It might also be possible to be more fancy and have a single job status cache for all gbasf2 projects that gets refreshed for all probjects once in a while. The HTCondor/LSF BatchProcess implementations have a job status cache, so we could take inspiration from that and just add a minimal update-interval to that.

These are my suggestions, but at the moment I don't have time to work on it, I have only few months remaining in my PhD and haven't started writing yet so I don't have the time to work on b2luigi features which I don't need but might be useful to others in my free time, but I'm up to discussing those things here.

Hi, thanks for the reply. I will try with the longer time of job status check interval during the next submission.
One problem that I might have is that I use a lower than optimal amount of workers: e..g. I have 15 gbasf2 projects to submit and only 12 workers.
However, is it true that all job submission and job downloading happen in just one thread?

However, is it true that all job submission and job downloading happen in just one thread?
Yes, this is actually a feature implemented on purpose, because on other batch systems like LSF and HTcondor, the user typically submits one task per jobs, and often has several thousands of tasks/jobs running at the same time. There, we don't want to have a monitoring thread/process for each job, otherwise we might get problems due to too many processes. This is explained in the b2luigi documentation:

In other luigi batch implementations, for every running batch job you also need a running task that monitors it. On most of the systems, the maximal number of processes is limited per user, so you will not be able to run more batch jobs than this. But what do you do if you have thousands of tasks to do?

The gbasf2 batch process in b2luigi is a bit an outlier, in that it wraps gbasf2 projects, therefore you have one task/worker per project. 15 workers should normally be no problem, if the computing load per worker is not high, though on NAF you might get a warning email that you have many processes open and shouldn't use the NAF login node for computation (even if the processes don't use up any CPU cycles).

In the code, I saw that the BatchProcess class, from which all other b2luigi batch processes including the gbasf2 one inhering, has set self.use_multiprocessing = False (see here). I don't think this is something that we can just set to true, but I never tried.

In theory we could make the gbasf2 download and submission processes non-blocking, e.g. start them with

proc = subprocess.Popen(gbasf2_submit_cmd, env=gbasf2_env)

instead of subprocess.run or subprocess.call. This function then doesn't wait for the command to finish but instead we can check the output of proc.poll() to check if the process has finished. Same for the download. We could then instruct gbasf2 to return the running job-status if the job is still submitting/downloading. Thus the download/submission would not bottle-neck the b2luigi scheduler. However, I'm worried a bit that the gbasf2 servers can't handle many parallel submissions, with downloads there were already warnings to not do them in parallel. However, this is not our problem, but something the gbasf2 developers should solve, i.e. allow for some parallel submissions/downloads and also improve the performance of those tasks. I heard already years ago that this should improve.

But definetly what might happen is to make the job-status less costly, ideally with a job status cache.

I think waiting with the task monitoring is not an ideal solution and I haven't seen an argument from your why we need this, so I think we can continue the discussion in other issues. The asynchronous submissions/downloads in #129 and for the job status cache I'll create an issue.