Using PyTorch in NYU HPC

A quick reference to access NYU's High Performance Computing Prince Cluster.

The official wiki is here, this is an unofficial document created as a quick-start guide for first-time users with a focus in Python and PyTorch.

Get an account

You need to be affiliated to NYU and have a sponsor.

To get an account approved, follow this steps.

Log in

Once you have been approved, you can access HPC from:

Within the NYU network (in campus):

ssh NYUNetID@prince.hpc.nyu.edu

Remember to replace NYUNetID for your own NetID.

Once logged in, the root should be: /home/NYUNetID, so running pwd should print:

[NYUNetID@log-0 ~]$ pwd
/home/NYUNetID

From an off-campus location:

First, Login to your VPN and then login to the bastion host, :

ssh NYUNetID@gw.hpc.nyu.edu

Then login to the cluster:

ssh prince.hpc.nyu.edu

Using Windows.

I use the MobaXterm ssh client with the following settings for the Prince Cluster:

Remote host: prince.hpc.nyu.edu
Username: NYUNetID
Port: 22

This makes it one click to open a terminal to Prince.

File Systems

You can get acces to three filesystems: /home, /scratch, and /archive.

Scratch is a file system mounted on Prince that is connected to the compute nodes where we can upload files faster. Notice that the content gets flushed every 60 days with no backup!

[NYUNetID@log-0 ~]$ cd /scratch/NYUNetID
[NYUNetID@log-0 ~]$ pwd
/scratch/NYUNetID

/home and /scratch are separate filesystems in separate places. Depending on how often you use your files you might want to choose the appropiate file system. I use /home for the files I won't touch often.

Loading Modules

Slurm allows you to load and manage multiple versions and configurations of software packages.

To see available package environments:

module avail

To load a model:

module load [package name]

For example if you want to use Tensorflow-gpu:

module load cudnn/8.0v6.0
module load cuda/8.0.44
module load tensorflow/python3.6/1.3.0

To check what is currently loaded:

module list

To remove all packages:

module purge

To get helpful information about the package:

module show torch/gnu/20170504

Will print something like

--------------------------------------------------------------------------------------------------------------------------------------------------
   /share/apps/modulefiles/torch/gnu/20170504.lua:
--------------------------------------------------------------------------------------------------------------------------------------------------
whatis("Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first")
whatis("Name: torch version: 20170504 compilers: gnu")
load("cmake/intel/3.7.1")
load("cuda/8.0.44")
load("cudnn/8.0v5.1")
load("magma/intel/2.2.0")
...

load(...) are the dependencies that are also loaded when you load a package.

Interactive Mode: Request CPU

You can submit batch jobs in prince to schedule jobs. This requires to write custom bash scripts. Batch jobs are great for longer jobs, and you can also run in interactive mode, which is great for short jobs and troubleshooting.

To run in interactive mode:

[NYUNetID@log-0 ~]$ srun --pty /bin/bash

This will run the default mode: a single CPU core and 2GB memory for 1 hour.

To request more CPU's:

[NYUNetID@log-0 ~]$ srun -n4 -t2:00:00 --mem=4000 --pty /bin/bash
[NYUNetID@c26-16 ~]$

That will request 4 compute nodes for 2 hours with 4 Gb of memory.

To exit a request:

[NYUNetID@c26-16 ~]$ exit
[NYUNetID@log-0 ~]$

Interactive Mode: Request GPU

[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash
[NYUNetID@gpu-25 ~]$ nvidia-smi
Mon Oct 23 17:49:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:12:00.0     Off |                    0 |
| N/A   37C    P8    29W / 149W |      0MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Submit a job

You can write a script that will be executed when the resources you requested became available.

A simple CPU demo:

## 1) Job settings

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00
#SBATCH --mem=2GB
#SBATCH --job-name=CPUDemo
#SBATCH --mail-type=END
#SBATCH --mail-user=itp@nyu.edu
#SBATCH --output=slurm_%j.out
  
## 2) Everything from here on is going to run:

cd /scratch/NYUNetID/demos
python demo.py

Request GPU:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --time=10:00:00
#SBATCH --mem=3GB
#SBATCH --job-name=GPUDemo
#SBATCH --mail-type=END
#SBATCH --mail-user=itp@nyu.edu
#SBATCH --output=slurm_%j.out

cd /scratch/NYUNetID/trainSomething
source activate ML
python train.py

Submit your job with:

sbatch myscript.s

Monitor the job:

squeue -u $USER

More info here

Transfer Files

I transfer files using MobaXTerm. If you need to setup a tunnel look here

PyTorch

Once you are all setup with the above, to get pytorch you need to do a couple of things:

Create a virtual Environment
Load the appropiate modules in the environment

Creating a Virtual Environment

mkdir /scratch/gs157/tmp/pytorch-gpu
cd pytorch-gpu/
module load  python3/intel/3.6.3
virtualenv --system-site-packages py3.6.3
source py3.6.3/bin/activate

After the above you have your virtual environment setup. Now you need to get pytorch

Installing pytorch

Note on 5/12/20: On Prince, GPU driver does not support CUDA 10.2, if you are running PyTorch, please try to use PyTorch built with CUDA 10.1.

pip3 install torch torchvision
pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

running pytorch short scripts

Now everytime you want to use your pytorch environment all you need to do is:

[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash - interactive gpu environment on HPC

[NYUNetID@gpu-25 ~]$ cd /scratch/NYUNetID/trainSomething
[NYUNetID@gpu-25 ~]$ python train.py

running jupyter notebook

Instructions are here

Once you copied and have your run-jupyter.sbatch

[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ sbatch run-jupyter-gpu.sbatch
[NYUNetID@log-0 ~]$ cat slurm-xxxx.out

in a separate window (ubuntu shell) type:


ssh -L NNNN:localhost:NNN netID@prince

Open a browser at localhost NNNN:

http://localhost:8925/?token=76f100825af441457502d5d080c1776b987a2f76101460f4

GusSand/hpc