steps to set up your compute canada resources and important guidelines
Before your start, please make sure you are familiar with the resources/terminlogies of Compute Canada available at Compute Canada Wiki
Important Warning: Do not run any intensive jobs on login nodes or directly on compute nodes. Only run jobs on compute nodes via the scheduler! Refer to the video.
- Login to CC
- Using modules
- Creating and using a virtual environment
- Installing Python modules
- Transfer data
- Interactive jobs
- Job scheduling
when using Linux or MacOs, you will need to open a terminal and use the command shown below:
[name@server ~]$ ssh -Y userid@cluster_name
The cluster name will be something like cedar.computecanada.ca or niagara.computecanada.ca.
More information about this can be found here
The spider sub-command searches the complete tree of all modules and displays it.
[name@server ~]$ module spider
If you specify the name of an application, for example with
[name@server ~]$ module spider python
this will show you a list of all available versions of the application. The outcome will be something like
Versions:
python/2.7.14
python/3.5.4
python/3.6.3
python/3.7.0
python/3.7.4
python/3.8.0
If you specify the name of the application along with a version number, for example with
[name@server ~]$ module spider python/3.6.3
this will display a list of the modules you must load in order to access this version. However, there is no prerequisite to load Python.
The sub-command list lists the modules that are currently loaded in your environment.
[name@server ~]$ module list
The sub-command load lets you load a given module. For example,
[name@server ~]$ module load python/3.6.3
could let you access the Python application, version 3.6.3. It is recommended to use this version.
Anytime you can find the version of the ongoing Python using the command python3 --version
. Besides, if there is more than one Python installed on your personal computer, the command which python3
returns the address where the ongoing Python is stored on your personal computer. It is highly recommended not to use Anaconda distribution of Python, instead install and add the required packages to Python3 using the package manager pip3. pip3 is the Python3 package manager and we later use this tool to install software packages written in Python3. Also the command pip3 --version
returns the version of the Python3 for which the software packages are going to be installed.
More information about this can be found here
In addition to the base Python module, the SciPy package is also available as an environment module. The scipy-stack module includes:
- NumPy
- SciPy
- Matplotlib
- IPython
- pandas
- Sympy
- nose
If you want to use any of these Python modules, load a Python version, then use module load scipy-stack
. However, as an alternative way, you can install these modules with the python package manager as explained later.
More information about this can be found here
With each version of Python, Compute Canada provides the tool virtualenv. This tool allows users to create virtual environments within which you can easily install Python modules. This tool allows one to install many versions of the same module, for example, or to customize a Python installation according to the needs of a specific project. It also helps to resolve version conflicts while installing new modules by simply removing virtualenv folder and reinstalling the modules. This will take no time and keep the Python tidy and prevents chaos. Therefore, it is highly recommended to first set up a virtual environment in both your personal computer and Compute Canada account before installing Python modules.
To create a virtual environment, ensure you have selected a Python version with module load python
as shown above. Then enter the following command, where ENV
is the name of the empty directory containing your environment. It is recommended to place ENV
in the home directory on the Compute Canada cluster:
[name@server ~]$ virtualenv --no-download ~/ENV
Once the virtual environment has been created, it must be activated:
[name@server ~]$ source ~/ENV/bin/activate
You should also upgrade pip in the environment:
[name@server ~]$ pip install --no-index --upgrade pip
To exit the virtual environment, simply enter the command deactivate:
(ENV) [name@server ~] deactivate
The procedure for creating a virtual environment on your personal computer is different but simple.
More information about this can be found here
The preferred option is to install it using the python package manager pip3 as follows:
- Load a Python module, as shown above
- Create and start a virtual environment.
- Install PyTorch in the virtual environment with pip3 install.
(ENV) [name@server ~] pip3 install --no-index torch torchvision
More information about this can be found here
The preferred option is to install it using the Python wheel as follows:
- Load a Python module, as shown above
- Create and start a virtual environment.
- Install TensorFlow in the virtual environment with pip3 install.
(ENV) [name@server ~] pip3 install --no-index tensorflow_cpu
(ENV) [name@server ~]$ pip3 install --no-index tensorflow_gpu
More information about this can be found here
tensorboardX is the indespensible module installed along with deep learning libraries to visualize the results such as training or validation carves.
Install tensorboardX in the virtual environment with pip3 install:
(ENV) [name@server ~] pip3 install tensorboardX
tensorboardX should be installed both on your personal computer and Compute Canada account. After running the Python script, you need to first download the log files created by the Python script from Compute Canada to your personal computer and then visualize the results using the command tensorboard --logdir logs
. logs is the folder address where you store the log files.
More information about this can be found here
Matplotlib is a plotting library for the Python programming language. Matplotlib can either be loaded as part of SciPy stack as shown above or installed with pip3 install in the virtual environment. Because Compute Canada doest not let the Python script get access to the display server you need to save the figures created by matplotlib as picture files, then download them to your personal computer to see the results.
IPython (Interactive Python) is a command shell for interactive computing, originally developed for the Python programming language. IPython can either be loaded as part of SciPy stack as shown above or installed with pip3 install in the virtual environment.
It is recommended to Use data transfer nodes, also called data mover nodes, instead of login nodes whenever you are transferring data to and from Compute Canada clusters. If a data transfer node is available, its URL will be given near the top of the main page for each cluster: Béluga, Cedar, Graham, Niagara.
As Compute Canada restricts the volume of data you can upload to the root directory, it is highly recommended to store your datasets and scripts in the /home/userid/scratch
folder, and use the root directory merely to set up your virtual environment as mentioned above.
The commands scp
can be used in a command-line environment on Linux or Mac computers to transfer data to and from Compute Canada:
scp -r source_address destination_address
For example, if your destination folder is located on the graham cluster, the destination address would be either userid@graham.computecanada.ca:/home/userid/scratch/destination_folder
or userid@gra-dtn1.computecanada.ca:/home/userid/scratch/destination_folder
. The later is recommended because it uses the transfer node instead of login node.
More information about this can be found here
More conveniently, you can transfer data to and from Compute Canada with FileZilla software. Install the software with sudo apt-get install filezilla
and then set up a connection with the desired cluster as follows:
-
file menu --> site manager
-
my site --> new site: choose a name, e.g., grahamCC
-
general tab:
- protocol: sftp
- host: garahm.computecanada.ca or gra-dtn1.computecanada.ca (recommended)
- logon type: normal
- user and pass
-
ok
-
connect
Though batch submission is the most common and most efficient way to take advantage of the clusters, interactive jobs are also supported. These can be useful for things like:
- Data exploration at the command line
- Interactive "console tools" like R and iPython
- Significant software development, debugging, or compiling
You can request resources from Compute Canada and start an interactive session on compute nodes with salloc:
salloc --time=1:0:0 --gres=gpu:1 --cpus-per-tasks=16 --mem=63500M --ntasks=1 --account=def-wkui
In this example, we requested the resources for 1 hour. You should not occupy the resources in interactive mode persistently for a long time (e.g, one day), otherwise you are likely to receive a warning from Compute Canada.
After getting the required resources, load a Python module as shown above, activate already created virtual environment as shown above, then run the Python script with the command pthon3 script_name.py
. Because we do not have access to IDE, you need to use the Python Debugger module to debug the script. To this end, use the command import pdb: pdb.set_trace()
within the script wherever you want to pause the program. This command plays the role of breakpoint in IDEs. More information about how to step in and step out with pdb
can be found here. For the training purpose and working with gpu units, the commands ipython
or pyhton3
in the virtual environment provides you with an interactive tool to write and run your code interactively.
More information about this can be found here
On Compute Canada clusters, the job scheduler is the Slurm Workload Manager. This section provides guidance on submitting jobs to Compute Canada clusters. The command to submit a job is sbatch:
sbatch job_name.sh
It returns a job ID that you can later use to monitor or cancel the job. It is noteworthy that you do not have to load Python or activate the virtual environment from terminal while submitting a job.
For each Python script/project to be scheduled for running, a Slurm job script is created. The job script allows to start, execute, and monitor the Python script (typically on parallel) on compute nodes. A Slurm job script looks like this:
#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for a multi-step job on a Compute Canada cluster.
# ---------------------------------------------------------------------
#SBATCH --account=def-wkui
#SBATCH --gres=gpu:1 # Request GPU "generic resources"
#SBATCH --cpus-per-task=16 # Cores proportional to GPUs: 6 on Cedar, 16 on Graham.
#SBATCH --mem=63500M # Memory proportional to GPUs: 31500 Cedar, 63500 Graham.
#SBATCH --time=03:00:00
#SBATCH --job-name=some_name
#SBATCH --output=/scratch/destination_folder/some_name-%j.out
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
echo ""
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."
echo ""
# ---------------------------------------------------------------------
# Run your simulation step here...
module load python/3.6.3
source ~/ENV/bin/activate
python ~/scratch/destination_folder/python_script_name.py
# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
It is mainly made up of two parts. Within the first part, we request the desired resources, and the second part consists of commands that we want the compute node to execute. After running the Python script, besides log files created by tensorboardX and picture files created by matplotlib, a text file known as the output file will be created that reports the running process and records the results of I/O commands such as print().
It is noteworthy that you rarely need more than one gpu as it is less likely to become saturated during the training process. In case you want to further decrease the running time of your script, request enough memory and increase the batch_size for each iteration. Otherwise, you need to make some changes to your script to fully carry out the task parallelism and also you need to wait further in the queue to get the resources.
More information about this can be found here
Niagara accepts jobs of up to 24 hours run-time, Béluga up to 7 days, and Cedar and Graham up to 28 days.
On the three general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster. There are partitions for jobs of
- 3 hours or less,
- 12 hours or less,
- 24 hours (1 day) or less,
- 72 hours (3 days) or less,
- 7 days or less, and
- 28 days or less
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job.
More information about this can be found here
For the program that is estimated to take too long to run, therefore, it is highly recommended to save and resume the state parameters and training index on a regular basis. This lets you resubmit the job several times with shorter running intervals to avoid waiting too long in the queue. To this end, this repository provides a great asset.
Each cluster of Compute Canada has its own node characteristics such as the amount of memory or cpu cores available per gpu unit. This information can be found on cluster's page on Compute Canada Wiki. Fortunately, they are concisely summarized in this python script.
Use scancel with the job ID, that you have already received when submitted the job, to cancel a job:
$ scancel <jobid>
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER
By default squeue will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
$ squeue -u $USER
You can also use the utility sq to do the same thing with less typing.
You can show only running jobs, or only pending jobs:
$ squeue -u <username> -t RUNNING
$ squeue -u <username> -t PENDING