/htcondor-matlab

Submit Matlab jobs to HTCondor from Matlab

Primary LanguageMatlabGNU General Public License v3.0GPL-3.0

Submit Matlab jobs to HTCondor from Matlab

htcondor-matlab is a set of Matlab functions to interface with the HTCondor high-throughput computing software framework, to submit Matlab functions as jobs.

It is assumed that the HTCondor machines share a filesystem and that all machines have access to the resources necessary to run the jobs, including an identical installation of Matlab. The functions use HTCondor commands and have therefore to be run on one of the HTCondor machines.

Installation

Put the htcondor-matlab functions into a directory on the Matlab path. Then copy condorConfig_template.m to condorConfig.m in the same directory and edit the copy. At a minimum, adjust the value of conDir to point to an existing and writable directory, the htcondor-matlab cluster directory, which has to be accessible from all HTCondor machines.

Example

The code for creating and submitting a cluster of jobs has the following form:

clusterHandle = condorCreateCluster;
for i = 1 : 16
    condorAddJob(clusterHandle, @exampleJob, {i}, 1)
end
condorSubmitCluster(clusterHandle)

In this example, the resulting cluster consists of 16 jobs, where each job runs exampleJob(i) with values of i from 1 to 16. The job function used here is included as exampleJob.m with htcondor-matlab; it takes a number as argument and returns its square.

Usage

A cluster is created by

clusterHandle = condorCreateCluster(description);

The cluster is assigned a handle, which is a string of the form cluster# where # is a sequential number starting from 0. It is used to identify the cluster to all other functions. The cluster can be given a descriptive label, but one is automatically generated if the argument is omitted.

A job is added to a cluster by

condorAddJob(clusterHandle, jobFun, argIn, numArgOut)

jobFun is the function handle of the Matlab job function; it can reference an m-file (including private) as well as an anonymous, local, or nested function. argIn is a cell array containing the arguments to be passed to the job function, and numArgOut is the number of its output arguments.

A cluster of jobs is submitted to HTCondor by

condorSubmitCluster(clusterHandle)

A cluster can be resubmitted with the same syntax, in case one or more of its jobs failed. Suitable jobs (neither still running nor completed successfully) are automatically identified and only they are resubmitted. If 'debug' is given as a second argument to condorSubmitCluster, jobs are not submitted to HTCondor but executed locally and sequentially, to facilitate finding programming errors.

After submission, the progress of a cluster's jobs can be monitored using

condorMonitorCluster(clusterHandle)

This function scans output, error and HTCondor log files of all jobs and prints overview information at regular intervals. It assumes a specific form of the output generated by the job function:

primary message 1
  secondary message 1
  secondary message 2
primary message 2
  secondary message 3
  secondary message 4

That is, a line with no leading whitespace is considered a ‘primary message’, a line with leading whitespace a ‘secondary message’. This way, information about larger processing units in the job can be separated from information that tracks progress within these units, giving a more fine-grained overview.

The output of condorMonitorCluster has tabular form with the following structure:
– The 1st column shows the job number ###, starting from 000.
– The 2nd column shows the last primary message.
– The 3rd column shows the last secondary message since the last primary message.
– In the 4th column, jobs that have Matlab error messages are marked with ‘∗’. Jobs that exited successfully are marked with ‘+’, that exited with an error are marked with ‘-’, and that crashed are marked with ‘~’. The HTCondor job status is indicated by one of the letters ‘I’ = idle, ‘R’ = running, ‘X’ = removed, ‘C’ = completed, or ‘H’ = on hold.
– The 5th column shows the HTCondor job identifier in the form ClusterId.ProcId.
– The 6th column shows the last event from the HTCondor log (excluding ‘image size updated’)

The information is presented as text in the Command Window, or the terminal window if Matlab is used without GUI. This has the advantage that condorMonitorCluster can also be used under an ssh login.

An error during job execution can be diagnosed by inspecting the output, error, and HTCondor log of a job using

condorInspect(clusterHandle, jobNumber)

The return values of the jobs in a cluster can be retrieved by

results = condorGetResults(clusterHandle);

results is a cell array with one element per job. If a job exited successfully, the corresponding element is a cell array containing the return value(s) of that job. If a job did not (yet) exit successfully, the element is an empty array. Instead of or in addition to returning values, job functions can of course also write their results directly to files.

A list of all existing clusters, including summary statistics about their jobs’ status, can be obtained by

condorClusters

It uses the same symbols as condorMonitorCluster, see above. Old clusters can be removed using condorClusters clean.

Example continued

With a probability of 50%, the job function exampleJob does not complete successfully, but throws an error. This is to simulate the fragility of job execution in real applications.

Monitor the submitted cluster until all the jobs have completed (symbol ‘C’). Most likely, some of them will have failed (symbol ‘-’). In that case, resubmit the cluster and monitor it again. Repeat this procedure until all jobs have completed successfully (symbol ‘+’). After that, the retrieved results should be a cell array of cell arrays containing the square numbers from 1 to 256.

Clusters, jobs, handles, and IDs

htcondor-matlab adopts the terminology of HTCondor: A single computation unit is called a ‘job’, and a group of jobs belonging together is called a ‘cluster’. In HTCondor, a job is also called a ‘process’ after submission. Cluster IDs are integers assigned by HTCondor in the order of submission, and process IDs are integers assigned to jobs in the order of the submit description file, starting from 0 within a cluster.

For technical reasons, the clusterHandle assigned by htcondor-matlab is not identical to HTCondor’s ClusterId. On first submission, the job number assigned by htcondor-matlab is identical to HTCondor’s ProcId, but resubmitted jobs belong to a new HTCondor cluster, with ProcIds starting from 0 again. However, condorMonitorCluster lists for each job the corresponding identifier of the form ClusterId.ProcId used by HTCondor, so that its tools including condor_rm can be easily used in conjunction.

Internal data structure

In the htcondor-matlab cluster directory, for each cluster a subdirectory is created with a name identical to its handle, cluster#, which contains data to manage and run the cluster as well as the return values of completed jobs. To save disk space, it is advisable to remove old clusters from time to time (condorClusters clean).

Within each cluster subdirectory, general cluster and job management data are kept in cluster.mat. After submission, the cluster’s HTCondor submit description file is submit. Job-specific data are in files whose name begins with job###, where ### is the job number. On addition of a job, the file job###_in.m containing the job’s Matlab input script and the file job###_inf.mat with job information used by that script are created. The job’s standard output is redirected to the file job###_out and its standard error to job###_err. HTCondor log messages are written to job###_log. When finished, the return values of the job are written to job###_res.mat.


This software was developed with Matlab R2013a and HTCondor 8.2.3 on Debian GNU/Linux 7.8, but may work with other versions and OSs, too. It is copyrighted © 2016 by Carsten Allefeld and released under the terms of the GNU General Public License, version 3 or later.