By Yuxiang Chai
For more information, please check the NYU HPC official website or email hpc@nyu.edu
Note:
- In this file,
<net-id>
means your net ID and you should replace it with something likebs1234
. And/scratch/<net-id>
should be something like/scratch/bs1234
. - When you see
$
in a code block, it means that line should be run in the command line. - For debugging and testing, I suggest to use
srun
command to create an interactive job, and usesbatch
command to train the whole job. Please check the sections below for more information.
- Apply for an account, through course instructor or other staff
- NYU VPN (Recommend)
- VS Code (Recommend)
- Connect to NYU VPN
- Open VS Code and enable
remote ssh
extension - Use
remote ssh
extension to connect to<net-id>@greene.hpc.nyu.edu
- Open your folder. (for most time, use
/scratch/<net-id>
instead of/home/<net-id>
, explained later)
(need knowledge of vim and other command line tools)
- ssh to gateway
$ ssh <net-id>@gw.hpc.nyu.edu
- from gateway, ssh to greene
$ ssh <net-id>@greene.hpc.nyu.edu
OR
- Connect to NYU VPN
- Directly ssh to greene
$ ssh <net-id>@greene.hpc.nyu.edu
Basically we use three systems.
/home/<net-id>
, or $HOME, which is for small files. Not flushed/scratch/<net-id>
, or $SCRATCH, which is for large files. Files not accessed for 60 days will get flushed./archive/<net-id>
, or $ARCHIVE, which is for long-term files. Not flushed.
I recommend to use /scratch
for most projects, and try not to encounter the disk space problem.
For most cases, we use HPC for GPU resources. And Greene provides two types of GPU: RTX 8000 and V100
From my experience, RTX 8000 wait-time is shorter than V100 wait-time, though RTX8000 may take longer time to train models.
I only use Python on HPC so this guide will only cover Python environment. Basically I use two virtual environments, and one is from venv
and the other one is from conda
. I will introduce both of them.
venv
is a built-in Python virtual environment command. It's easy to use but also has some disadvantages.
Pros:
- No need to mess with conda environment.
- Easy to use when submitting a job.
- Fit for most popular packages.
- Easy to install a ipykernel for Jupyter.
Cons:
- It cannot create an environment with specific Python version, and it can only create an environment with current Python version.
- For some packages, it's hard to install with
pip
and may encounter some problems. - Cuda version is not changeable.
To use venv
,
- First we need to load a Python module.
$ module purge # purge everything
$ module avail python # list all python available versions, right now only 3.8.6 is available
$ module load python/intel/3.8.6 # load a python module
- Then we can create our own virtual environment.
$ cd /scratch/<net-id> # or cd $SCRATCH
$ python -m venv myenv # create a folder(environment) under the current directory. You can replace 'myenv' with other names
$ source myenv/bin/activate # activate the environment you created
(myenv) $ python --version # now we can see myenv is activated and shown at the front
- Install packages you need.
(myenv) $ pip install numpy
- Submit a job (in the section Submit a job)
conda
is a widely used environment on all platforms.
Pros:
- Packages more than Python packages
- Specify Python version
- Can install specific cudatoolkit version for Pytorch
Cons:
- sometimes will get messed up due to dependency conflicts
- more steps to setup
- more steps to use when submitting a job
- more steps to install a ipykernel for Jupyter.
To use conda
:
- First we need to load anaconda module.
$ module purge # purge everything
$ module avail anaconda # list all anaconda available versions, the latest should be anaconda3/2020.07
$ module load anaconda3/2020.07 # load the module
- Then we use anaconda to create an environment.
$ conda init bash # initialize the bash
$ source ~/.bashrc # to reload the bash and make conda work
(base) $ conda create -p /scratch/<net-id>/env39 python=3.9 # we can specify python version when creating the environment, you can replace the 'env39' with other names
(base) $ conda activate /scratch/<net-id>/env39/ # activate the environment env39
(env39) $ conda list # this command will show the installed packages in env39
Note: If encounter disk quota exceeded
error, run the following command.
$ conda clean -a
Three ways to run a job.
- SBATCH
- SRUN
- OOD
This is the most common way to run a job.
- First we create a file called
script.sbatch
(the name doesn't matter)
$ touch script.sbatch
- Then in the file, we need to add some sbatch commands.
For venv
#!/bin/bash
#
#SBATCH --job-name=myjob
#SBATCH --output=myjob.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:1
#SBATCH --mail-type=END
#SBATCH --mail-user=<net-id>@nyu.edu
module purge;
cd /scratch/<net-id>;
source myenv/bin/activate;
cd project;
python train.py --optim sgd;
For conda
(details here)
#!/bin/bash
#
#SBATCH --job-name=myjob
#SBATCH --output=myjob.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:1
#SBATCH --mail-type=END
#SBATCH --mail-user=<net-id>@nyu.edu
module purge;
module load anaconda3/2020.07;
source /share/apps/anaconda3/2020.07/etc/profile.d/conda.sh;
conda activate /scratch/<net-id>/env39/;
export PATH=/scratch/<net-id>/env39/bin:$PATH;
cd /scratch/<net-id>/project;
python train.py --optim sgd;
SBATCH arguments
job-name
: name of joboutput
: console output will be put in this filenodes
: how many nodes you need, usually 1, unless parallelntasks-per-node
: how many tasks per node, usually 1, unless parallelcpus-per-task
: how many cpus per task, usually 1mem
: memory size you needtime
: set time so HPC will not kill the job after 1 hour. HPC will kill the job after the time you setgres
: gpu usage. can begpu:1
,gpu:2
(for parallel),gpu:rtx8000:1
,gpu:v100:2
...mail-type
andmail-user
: send an email when the job end or start (if you set the mail-type to start)
- Submit the job (see more here)
$ sbatch script.sbatch
- other commands (see more here)
$ squeue -u <net-id> # to check the status of your jobs
$ scancel <job-id> # to cancel the job
srun
is to start a interactive bash job. I usually use this to debug and test my programs and then use sbatch
to train the whole job.
$ srun --mem=8GB --time=2:00:00 --gres=gpu:rtx8000:1 --pty /bin/bash
After you run the above code, you will see that the address of the command line has changed to something like gr08
rather than login-1
, and you can use the following command to check the GPU status and CUDA version.
$ watch -n 1 nvidia-smi
(you can press ctrl+c
to quit the interactive window generated by the above command)
command line arguments are the same as SBATCH arguments listed above.
OOD (Open OnDemand) is the GUI tools for HPC. Log into https://ood.hpc.nyu.edu and select Interactive Apps, where you can find many tools including Jupyter Notebook.
Also we can easily use venv
environments in OOD. All you need to do is to activate the environment and install ipykernel
and register it. You can find tutorials online.
For conda
environments, it's more complex. We need to use singularity
to build the jupyter kernel environment. Official link is here.