pygridtools/gridmap

Potential extension to the library

sandeepklr opened this issue · 2 comments

Hello...

Edit: I just realized that there is a related issue already opened: Add support for limiting the number of concurrently executing jobs #27. I believe this is also what I am trying to address with my approach below.

I am trying to modify gridmap behavior. I have a bunch of jobs(~10,000 or so) that I have to run to completion. Essentially, I don't care about combining results from each individual job, I just need to run each self-contained job to completion. I would like to use your infrastructure to essentially build on top of to achieve this. Also, I don't want to submit all ~10,000 jobs at once to the GridEngine. Instead I want to batch these jobs up in chunks and submit them...essentially, build a pool of processes. A queue would probably be a better term? When one process finishes, either due to completion or exception, remove this from the job queue and add one or more new processes to the pool based on the number of free spots left in the pool.

According to me, minimally, this would involve two changes to the JobMonitor script to achieve this:

  1. Prevent JobMonitor from exiting when it encounters an error in check_alive(). This can be simply achieved by overriding the check_if_alive() method, calling super.check_if_alive() in the override and catching the exception it may throw because of a compute node subprocess throwing an exception...
    class JobMonitorExt(job.JobMonitor):
        def check_if_alive():
            try:
                super.check_if_alive()
            except:
                pass
  1. Batching jobs: This can be done by overriding all_jobs_done() and instead of checking if all jobs are done, just check if at least one process has finished(either exception or completed). If there is a process like this, remove it from the process queue and add a new function process to the current session by calling _append_job_to_session(). all_jobs_done() returns True only when all the ~10,000 jobs have been processed.
    class JobMonitorExt(job.JobMonitor):
        def all_jobs_done():
            # check if some job is done(exception or completed). If some job is done, remove that job from queue and add new job.
            # Return true if no more jobs to process

Do you think this will work? I'd appreciate your thoughts on this.

Prevent JobMonitor from exiting when it encounters an error in check_alive().

Have you ever actually encountered an exception with JobMonitor.check_if_alive()? I can't say I've seen one. The JobMonitor runs on the machine that all the jobs get submitted from, not any of the worker nodes. See this wiki doc for details of how things work under the hood.

Batching jobs: This can be done by overriding all_jobs_done() and instead of checking if all jobs are done, just check if at least one process has finished(either exception or completed). If there is a process like this, remove it from the process queue and add a new function process to the current session by calling _append_job_to_session().

The thought of having our own internal job queue never occurred to me, since that's supposed to be the whole point of using something like Grid Engine in the first place, but I can see how that would solve the batching problem we've been discussing for a while in #27. That said, I think just simply placing the jobs on hold after they get submitted to Grid Engine would be the better user experience, since they'd get to see the jobs as suspended/held in qstat.

I'd love the help with GridMap, so if you put together a PR that uses either approaches, I'd gladly review it.

Have you ever actually encountered an exception with
JobMonitor.check_if_alive()?

This is the exception that is propagated when a child process on a compute
node throws one and terminates, in which case we exit the event loop as
well and kill the other child processes on the way out because of the
JobMonitor context. Instead, for my requirement, I'd rather let the other
child processes complete. By overriding JobMonitor.check_if_alive(), and
wrapping it in a try/except block, I can let the event loop continue even
after a child process on a compute node terminates due to exceptional
circumstances.

I will be modifying the library code because I need these changes to run my
jobs. I have ~10K self-contained jobs, so I do want them to run to
completion even if some of them terminate due to exception conditons and
I'd like to batch the jobs up in my own local queue as opposed to dumping
10K jobs on the SGE queue at once.

What's a good way to go about architecting this from a code standpoint?
Modifying the guts of your code or extending the JobMonitor class to add
this functionality?