/examples-pbs-dgx

Example scripts to use DGX nodes with PBS scheduler

Primary LanguageShell

The first three test inputs are designed to gave the basic information required to use the DGX nodes effectively.
The rest of the examples cover more advanced topics.

The project code in the job script should be replaced with your project coode (for example 41000001 was the pilot project code).

Copy or "git clone" this directory to your own project or user directory.

Submit first basic example:
# examine "submit.pbs" to understand the settings, change the project code and submit:
cd 1-basic-job && \
 project="12345678" && \
 sed -i "s/-P 99999999/-P $project/" submit.pbs && \
 qsub submit.pbs

The following commands checks status of your jobs:
 qstat 

To kill a job use (replace 000000.wlm01 with job id):
 qdel 000000.wlm01

Once jobs have finished check */*.o[0-9]* to see standard output and error from jobs.
Reference output is available in the directory "reference-output" in each of the test cases.

# Repeat for the other examples:
2-mxnet-training	run a training job
3-pip-install		install a python package inside a running container
4-taskfarm              run multiple commands on a single GPU in the same job

There is also other infomation in the examples:
docker-build		building custom images
singularity		use Singularity containers instead of Docker
horovod			multi-node training
jupyter			starting a jupyter notebook
mpi			a multinode MPI example
staging			using the local SSD /raid filesystem on the nodes

Please use the job scripts as templates for your own workflows.

For general information on user enrollment, new user information, projects, PBS job scheduler, etc. then consult the user guides at https://help.nscc.sg/user-guide/

If you are having any problems running these examples then the NSCC Service Desk can be contacted by email at help@nscc.sg or by using the portal at https://servicedesk.nscc.sg/