This repository contains the prototype for my Master's thesis realized at Henallux for the Master's in IT Systems Architecture.
The complete Master's Thesis, titled Automated Job Submission Management for Grid Computing, can be found here.
The objective of this thesis is to create an architecture to aid in the submission, monitoring and retrieval of scientific jobs submitted to various HPC clusters regrouped within the CECI.
This proof of concept is a partial implementation of some of the features proposed within my thesis, applied to a command line utility tool.
This prototype leverages information leveraged within a database accessed entirely through a REST API. Both are self-hosted due to the limited extend of this proof of concept, however below you will find the information required if you wish to reproduce the infrastructure to run this proof of concept.
The database used is MongoDB, however any form of document database should work if you wish to adapt the API's code. Same is applicable for any type of database, although more work will be required to adapt the code.
For simplicity, the database is not protected by any authentication mechanisms, mainly because it was accessed only by the API as it's backend, while the API's frontend is protected. For this prototype, the MongoDB was hosted on a Ubuntu Server in a Docker container using the following command:
docker run -d -p 27017:27017 -v ~/mongodb:/data/db mongo
A database named masi_tfe
was used with the following collections/tables:
- clusters: containing an inventory of the clusters (in this project, those present in the CECI)
- jobs: where each job submission was submitted
All of the database access code for the API server is located in the api_srv/dao
folder, with a
config.ini
file containing the database access values.
The following is an example item for the Lemaitre3 cluster hosted at the "Université catholique de Louvain".
{
"_id" : ObjectId("5ccd9ade865e7214bb340e7a"),
"name" : "lemaitre3",
"hostname" : "lemaitre3.cism.ucl.ac.be",
"port" : 22,
"workload_manager" : "slurm",
"preferred_jobs" : [ "mpi" ],
"partition" : "batch",
"status" : {}
}
The name
variable is used as a more user-friendly identifier for the user-input. The
workload_ manager
is by workload_manager.py
in order to dynamically know how to interact
with the cluster's workload manager. In the case of this prototype, Slurm was the only
load manager to interact with. The preferred_jobs
is used to indicate which type of jobs the
cluster is best suited for. The possible values for this field used in this prototype are
mpi
, serial
and smp
. These values are used upon submission to find alternative
destination clusters if the user's original target is not available/reachable. The
partition
field is a data proper to Slurm scripting and the CECI clusters (variable
depending on the cluster).
The status
value is used to store resource information regarding the cluster as a whole as well
as its independent nodes. This value is update by the wrapper module upon execution on the
destination cluster (see full thesis for more details).
The following is an example of job submission after its completion.
{
"_id" : ObjectId("5ccf3e094a80d939aee76f29"),
"job_uuid" : "503d3f77-a21b-4e52-86c4-eec0f0cd9c16",
"user_input" : {
"username" : "glongree",
"requirements" : [ "chmod +x job" ],
"user_mail" : "gaetanlongree@gmail.com",
"job" : "https://github.com/GaetanLongree/test.git",
"destination_cluster" : "lemaitre3",
"kwargs" : {
"key1" : "value1"
},
"output" : [ "result.txt" ],
"execution" : [ "srun ./job" ],
"resources" : {
"duration" : "10m",
"memory_per_thread" : "100M"
}
},
"destination_cluster" : {
"partition" : "batch",
"name" : "lemaitre3",
"hostname" : "lemaitre3.cism.ucl.ac.be",
"preferred_jobs" : [ "mpi" ],
"workload_manager" : "slurm",
"port" : 22
},
"modules" : null,
"online_job_file" : true,
"tar_file" : null,
"job_directory" : "/home/users/g/l/glongree/503d3f77-a21b-4e52-86c4-eec0f0cd9c16/",
"job_status" : {
"ReqTRES" : [ "cpu=1,mem=100M,node=1" ],
"WorkDir" : [ "/home/users/g/l/glongree/503d3f77-a21b-4e52-86c4-eec0f0cd9c16/test" ],
"NodeList" : [ "lm3-w080" ],
"NCPUS" : [ "1" ],
"CPUTime" : [ "00:00:17" ],
"JobID" : [ "67873032" ],
"State" : [ "COMPLETED" ],
"NNodes" : [ "1" ],
"AllocTRES" : [ "cpu=1,mem=100M,node=1,billing=1" ],
"Elapsed" : [ "00:08:20" ],
"ExitCode" : [ "0:0" ]
},
"working_directory" : "/home/users/g/l/glongree/503d3f77-a21b-4e52-86c4-eec0f0cd9c16/test"
}
The job_uuid
is the value generated upon submission and used by the user to track the job status.
The user_input
values corresponds to the data submitted by the user (explained in later section).
The destination_cluster
contains the values of the cluster on which the job is being ran. The
modules
field contains information relating to the external modules executed, at which stage
and what were their output (described later as well). The online_job_file
is used to know whether
the job submitted is purely stored online or if it is a local file that needs to be transferred to
the cluster. The tar_file
is used in case the job files are local but are multiple and are part
of a directory (the folder is compressed into a TAR file to optimize transfers). The job_directory
and working_directory
represent, respectively, the base location of the job folder and the
location where the job proper is being ran by the workload manager. The job_status
is the
information gathered regarding the job and is obtained by the workload_manager.py
module based
on the underlying workload manager used by the cluster (again, Slurm in this prototype's use case).
The rest API is self-hosted at my own domain, and is also ran on an Ubuntu server in a docker
container. This docker container is built from the Dockerfile
located in the root of this
repository. Due to the nature of Flask and the IP binding, the commands to build and run the
container are done on the using the host's network. While this is not a recommended practice,
the API was very accessory to the prototype, hence it did not require to operate under the best
practices, it merely required to be working.
The commands to build and run the container are as follows:
docker build --no-cache --network=host -t tfe_api .
docker run -p 7676:7676 --network host -d --name tfe_api_srv tfe_api
The Flask server did not use an SSL certificate, as it was behind a reverse proxy using HTTPS
transparently. The API server does use a simple Basic Authentication merely as basic protection
since the API was publicly accessible. To change the authentication username and password,
edit the values in api_srv/tools/authenticator.py
.
The endpoints used are as follows:
Endpoint | Method | Details |
---|---|---|
/clusters | GET | Used to get all the clusters and their details. |
/clusters | PUT | Updates the status of a cluster. Requires the name of the cluster in the payload to identify the target cluster, and status field containing the entire values to update. |
/jobs/<job_id> | GET | Return the entire values stored for a given job UUID. |
/jobs/<job_id> | POST | Used to create a new job. |
/jobs/<job_id> | PUT | Used to update a current job ID. The payload's content will update the existing fields without overwriting the entire job. |
/jobs/<job_id>/state | GET | Returns an array of the job's current state (array is used for parallel jobs). |
/jobs/<job_id>/state | PUT | Used to instruct the API server to send a mail to the job's owner regarding an update in the job's current state. |
/jobs/<job_id>/notify | PUT | Used to instruct the API server to send a mail to the job's owner regarding an error that has occurred along with the error message. |
The mail service used for the /jobs/<job_id>/state
and the /jobs/<job_id>/notify
endpoints
can be modified to use another mail provider by editing the values in the api_svr/tools/mailer.py
file.
For the service and the prototype to work you must edit some values in three separate files:
- In the API server files:
- Edit the
api_srv/__main__.py
file to change the SERVER_IP and SERVER_PORT values
- Edit the
- In the Project files:
- Edit the
project/tools/api_handler.py
file to change the URI and AUTH values - Edit the
project/tools/wrapper/api_communicator.py
file to change the URI and AUTH values
- Edit the
There are three possible use cases to use this prototype:
- Submitting a job
- Submitting a job along with external modules to execute
- Retrieving a already submitted job
To submit a job, from this repository's root folder, run the following command:
python -m project -u <path/to/user/input/file>
To submit a job along with a list of modules to execute (from this repository's root folder still):
python -m project -u <path/to/user/input/file> -m <path/to/modules/file>
Depending on the user input files, you may be given various warnings about required or optional values missing from the input file.
You will also be asked to enter the passphrase for the private key file you have provided before the connection is made.
Lastly, if the job submission was successful, you will be given a UUID specific to your job.
If the API server and mailer.py
modules have been correctly configured, you will also receive
emails whenever there is a change in your job's running state.
To retrieve a job, using the UUID given at submission, use the following command (from this repository's root folder):
python -m project -r <uuid>
The following sections describe the user input and modules specification files along with some specifics.
The user input file is a YAML file, a sample file can be found in the root of this repository, with the content presented and explained below.
username: jdoe
private_key: "path_to_key"
passphrase: "private_key_passphrase"
user_mail: johndoe@mail.com
destination_cluster: lemaitre3
requirements:
- "chmod +x job"
job: https://github.com/GaetanLongree/test.git
compilation:
- "gcc job.c"
execution:
- "srun ./job"
output:
- "result.txt"
resources:
nbr_jobs_in_array: 15
duration: 1d2h30m
nbr_processes: 5
nbr_threads_per_process: 2
memory_per_thread: 2625M
distribution: distributed
# grouped | scattered | distributed
#if 'distributed' only
nbr_of_nodes: 4
kwargs:
key1: value1
The table below reprises the
Key | Value |
---|---|
username | Your username used on the CECI clusters. |
private_key | Path to the private key file used to authenticate on the CECI clusters |
passphrase | (optional) The passphrase for the private key file. If not provided, will be asked in the command line. |
user_mail | The email address where the mail notifier will send job state updates and notifications. |
destination_cluster | The name of the cluster to which to submit the job. If the cluster is not reachable, the program will propose other reachable clusters that present the same preferred jobs as the cluster originally indicated. |
requirements | (optional) A list of commands that will be executed as is on the destination cluster. This can be used to install additional packages from source for example. |
job | Can be a file, a directory or a Git URL. If a file is given, it will be transferred to the destination cluster using SFTP. If a directoyr is given, it will be compressed in a TAR archive before being transferred, and will be unpacked before execution. If a git repository is given, it will be clones (recursively) and the working directory will be automatically set to the cloned repository's folder. |
compilation | (optional) If the job requires compilation prior to execution, enter the full command to run. Can be a single command, or a list of commands. |
execution | The list of commands to enter within the Slurm script. Commands must be complete and include the srun directive if necessary. |
output | A single value or a list of values of the files where the job's result will be stored. This value is used by the retrieval function to pull the output of the job after execution. |
resources | (optional) The resources to use for the job. These values are translated into the Slurm script. See note below. |
kwargs | (optional) A list of keyword arguments that can be used by external modules. The entire user input will be passed to each external module in order for them to make use of this information if needed. |
The resources are not mandatory, as it is assumed that the user may make use of external modules to determine the resources.
The resource values such as duration
and memory_per_thread
are translated from human
readable form into the workload managers expected values. The supported values are as follows are
d
, h
, m
and s
for the duration
, and G
and M
for the memory_per_thread
.
Regarding the distribution
key, this can take 3 values:
grouped
scattered
distributed
If the value is distributed
, a nbr_of_nodes
is required to tell the Slurm workload manager how
to distribute the job across the nodes.
This project is based around the ability to execute external modules in addition to submitting job. The modules are executed at three stages: staging, executed locally before connecting to the cluster, preprocessing, after connecting to the cluster and before the job submission on the cluster, and finally postprocessing, after the job has executed and completed successfully.
The modular execution supports any language as long as the compiler and/or interpreter is present.
Despite this fact, it is not impossible to execute modules where the interpreter is not present,
but it will require manual installing through the requirements
section of the user input values.
A sample modules_sample.yml
file is present in the root folder of this repository, with the
content presented and explained below.
staging:
1:
module: test.c
compilation: gcc -o test test.c
execution: ./test
2:
module: resource_estimator.py
execution: python resource_estimator.py
preprocessing:
1:
module: remote_test.c
compilation: gcc -o remote_test remote_test.c
execution: ./remote_test
2:
module: resource_estimator.py
execution: python resource_estimator.py
postprocessing:
1:
module: test.c
compilation: gcc -o test test.c
execution: ./test
2:
module: remote_test.py
execution: python remote_test.py
The top three stages are represented, each containing numbered key/value pairs of objects. These numbered keys act as indexes to determine the order of execution.
The module
directive is used to instruct which module is to be executed, and therefore transferred
to the destination cluster. The project will look for modules in the project/modules
folder,
which is where you must place the module you which to use.
The compilation
directive is optinal and is used if the module requires compilation. This command must be
entered exactly as it would be entered on the destination cluster, including the name of module
itself. This was done in order to account for future-proofing any modules with specific compilation
commands.(Note: the original idea was to omit the need to specify the module file.)
The execution
directive is used to specify how to execute the module. The command must be written
exactly as it is to be executed on the remote system.
When a module is executed, it will be automatically be passed as argument the user input key/pair structure as it is at this stage of the execution.
By the end of the modules execution, it must output through the regular STDOUT the same key/pair structure, either modified internally or unmodified if it has not been altered.
This mechanism is implemented in order for the modules to have direct interaction with the user input and potentially alter them, while allowing subsequent modules to have access to the information performed.