This is a simple recipe for launching jobs on a cluster.
It is mostly meant for clusters that lack proper queueing systems (such as the vision cluster at MIT).
- Install tmux
- Add this line to your
.tmux.conf
:bind-key y set-window-option synchronize-panes
- Put
crunch.sh
into your path - Modify
crunch.sh
to change .csail.mit.edu domains to your domain
- Start tmux
- Run command:
crunch N machine1 machine2 machine3
whereN
is the number of replicas you want, and the rest are the hostnames of the machines you want to connect. This will cause tmux to open N*number_of_hosts panes, with SSH connections into each machine. - Press
Ctrl+b y
to synchronize all the panes. - Start typing, and your input will be broadcast to all machines.
If you write your scripts in the right way, you can use the above workflow to process data in parallel. Below is a psuedo-code that shows the basic idea. It assumes you are using a networked file system (such as NFS).
workload = ["file1.json", "file2.json", ...]
random.shuffle(workload) # in matlab, you must remember to set the random seed
for workitem in workload:
in_path = "input/" + workitem
out_path = "output/" + workitem
lock_path = out_path + ".lock"
if exist(out_path) or exist(lock_path):
continue
mkdir(lock_path) # create lock file
# do heavy lifting here that eventually writes out_path
rmdir(lock_path) # remove lock file
Remember, everytime you connect to a CSAIL machine, you must authenticate, otherwise your jobs will be killed eventually. To get around this, you can start your session with longscreen
, and start tmux inside the longscreen.