This is a setup guide to run RL models on NYU HPC Prince cluster.
From your terminal:
ssh user_id@gw.hpc.nyu.edu
ssh prince.hpc.nyu.edu
Better way: Is to set up one off ssh tunnel, follow the following steps on your local computer:
mkdir ~/.ssh
chmod 700 ~/.ssh
Add the following lines to .ssh/config:
Host hpcgwtunnel
HostName gw.hpc.nyu.edu
ForwardX11 no
LocalForward 8025 dumbo.hpc.nyu.edu:22
LocalForward 8026 prince.hpc.nyu.edu:22
User NetID
# next we create an alias for incoming packets on the port. The
# alias corresponds to where the tunnel forwards these packets
Host dumbo
HostName localhost
Port 8025
ForwardX11 yes
User NetID
Host prince
HostName localhost
Port 8026
ForwardX11 yes
User NetID
Now for login just do:
ssh hpcgwtunnel
ssh -Y prince
There is no module of Open AI gym on the cluster so to run most of RL models we need to install on our own. We can't install them directly on the cluster environment so we have to create a virtual environemt where our code will run.
cd \scratch\$user_id$\
nano create_env_sh
Copy the following code in the above created file
# Remove any preloaded modules
module purge
# activate existed python3 module to get virtualenv
module load python3/intel/3.7.3
module load cuda/10.0.130
# create virtual environment with python3
virtualenv --system-site-packages mlenv
# activate virtual environment
source $HOME/mlenv/bin/activate
# install torch and gym
pip install torch torchvision
pip install gym
pip install gym[atari]
In the above code firstly we load already installed modules on the cluster such as python3
and cuda
. You can also try loading anaconda3
but it did't work for me. There is no need for cudann
module as it will be loaded automatically with python3
.
To load any other module use can use module avail
to see available list of modules, or search using module keyword $keyword$
. After activating environment install all the modules using pip
you need for project.
Now create the environment by running:
source create_env.sh
You only need to create environment once and then for starting a environment everytime you just need to load all modules and activate environment source $HOME/mlenv/bin/activate
.
To make things easy create a script for that:
nano run_env.sh
module purge
module load python3/intel/3.7.3
module load cuda/10.0.130
source $HOME/mlenv/bin/activate
To start environment: source run_env.sh
.
To submit a job you need to create a batch script, like:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --mem=16000
#SBATCH --job-name="ccm_project"
#SBATCH --time=12:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=$email_id$
#SBATCH --output=slurm_%j.out
module purge
module load python3/intel/3.7.3
module load cuda/10.0.130
# Replace with your NetID
NETID=###
# activate virtual environment
source /home/user_id/mlenv/bin/activate
export PYTHONPATH=/scratch/user)id/RL/Project/Updated/:$PYTHONPATH
cd /directory of the file you want to run
python3 ./game.py
Make sure you change based on your requirements, such as load all modules you need and if you want to run inside a virtual environment then activate virtual environment. cd
into the directory and the run the program. PYTHONPATH
is neceesary if you are runnign a modularized code otherwise it will not to be able to import
code from other files.
Now instead of loading and activating files manually you can also use the run_env.sh
script, you created above.
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --mem=16000
#SBATCH --job-name="ccm_project"
#SBATCH --time=12:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=$email_id$
#SBATCH --output=slurm_%j.out
# Replace with your NetID
NETID=###
# activate virtual environment
source /home/user_id/run_env.sh
export PYTHONPATH=/scratch/user)id/RL/Project/Updated/:$PYTHONPATH
cd /directory of the file you want to run
python3 ./game.py
Running A job
sbatch $job-script$
To see already running jobs:
squeue -u $user_id$
To cancel jobs:
scancel $job_id$
Output of the file (such as logs and standard output and erro) will be created in the directory from where you subimitted the job
First login and start tunnel
scp source destination
Either source or destination can be on another (remote) host, by prefixing the path with "hostname:"
From local → remote
scp local_file prince:/scratch/net_id/
From remote → local
scp prince:/scratch/net_id/ local_file
Download link mentioned on NYU cluster is not working so I have attached the correct version in this repository.
After installing:
Step 1: Start Fugu. Select SSH &rarr new SSH tunnel
- Create tunnel to: prince (or dumbo)
- Service or port: 22
- Local port: 8026 (8025 to dumbo)
- Tunnel host: gw.hpc.nyu.edu
- Username:
Step 2: In SFTP window
- Connect to: localhost
- Username:
- Port: 8026 (or 8025)
Step 3:
- Click connect and enter password.
- Drag and drop files to copy/paste to and from cluster.