Working on Greene can be very confusing. One key concept to keep in mind is we refer to four different "workspaces" that are important to keep straight: local, Greene login, Greene compute, and Greene compute container.
- Local is your computer.
- Greene login is the entry point into the Greene cluster that you access when you run
ssh greene
after properly configuring your .ssh/config file (see part 1 for what I mean by this). You should use this to request compute nodes for running your code as well as organizing your scratch folder, which is shared across all nodes on the Greene cluster. - Greene compute is the compute node you can request from Greene login. This is where you will launch your Singularity instance from.
- Greene compute container is your Singularity instance that is running on Greene compute. This has your full conda environment with all your dependencies, and is what you will connect your VSCode instance to if you want to do remote development or run code remotely.
To connect to the Greene cluster, you must first make sure you have your SSH settings configured properly on your local drive at ~/.ssh/config
. If this file does not exist on your computer yet, run touch ~/.ssh/config
. Once you have verified this file exists, follow the steps outlined below:
- Open your ~/.ssh/config file using your desired bash text editor (running
vim ~/.ssh/config
ornano ~/.ssh/config
are both good options) - If your ~/.ssh/config file does not look like the one provided in this repository (see example_ssh_config_file), copy the contents of example_ssh_config_file into your ~/.ssh/config file but change all NetID references (where it says alc9635) to your own.
- Save and close your ~/.ssh/config file.
- SSH into Greene:
ssh greene
(this will work only if your ~/.ssh/config file is set up properly, see Part 1 if you haven't done this yet) - Go to your scratch folder:
cd /scratch/[YOUR_NETID_HERE]
- If you don't have this repository in your scratch folder, git clone it into your scratch folder.
- If you already have the correct Dataset_Student/ folder on your Greene instance in your project folder, you can skip this step. Otherwise, download it from the Deep Learning Google Drive, then run the following in a LOCAL Terminal (replace with your netID):
cd ~/Downloads
scp Dataset_Student_V2.zip [YOUR_NETID_HERE]@greene.hpc.nyu.edu:/scratch/[YOUR_NETID_HERE]/video_prediction_project
Then, go back to the Terminal that's connected to Greene login and run
cd video_prediction_project
unzip Dataset_Student_V2.zip
you can delete the .zip file after this is done.
- Similarly, make sure hidden folder on your Greene instance in your project folder.
cd ~/Downloads
scp hidden_set_for_leaderboard_1.zip [YOUR_NETID_HERE]@greene.hpc.nyu.edu:/scratch/[YOUR_NETID_HERE]/video_prediction_project
Then, go back to the Terminal that's connected to Greene login and run
cd video_prediction_project
unzip hidden_set_for_leaderboard_1.zip
you can delete the .zip file after this is done.
- Run
cd NYU_HPC
, then from that folder runmake build
. This will tell bash to start building the overlays that will inject a conda environment with all our requirements into our Singularity instance. After you've done this once, you won't need to do it again (it takes a long time to run).
After setting up Parts 1 and 2, you can skip straight to Part 3 every time you want to connect to a Greene compute node to do remote development / run model training or any other GPU-intensive tasks.
- SSH into Greene login:
ssh greene
- Go to the NYU_HPC folder within your project folder:
cd /scratch/[YOUR_NETID_HERE]/video_prediction_project/NYU_HPC
- Run
make getnode
. This will submit a job requesting a greene compute node. - Run
squeue -u $USER
to check the node ID of the greene compute node you were provisioned in the very last column titled "NODELIST (REASON)". It might take a while for an actual ID to show up here, so just re-run this command every minute or so until you see something like "gv014" in that column. - Copy the node ID, then open a different Terminal LOCALLY. On the local terminal, go to ~/.ssh/config file and paste the node ID under the section with the header "Host greenecompute greenecomputecontainer". For example, if your node ID is gv016, that section of your ~/.ssh/config file should look like this:
- Go back to your Terminal instance that's connected to Greene login, and run
exit
to close the Greene login connection. Then, runssh greenecompute
in that same terminal to connect to Greene compute. - Run
cd /scratch/[YOUR_NET_ID_HERE]/video_prediction_project/NYU_HPC
. - Run
make sing
to start your Singularity instance on your Greene compute node. You should see a message in your Terminal saying the Singularity instance was successfully launched. - From here, you have two options. See part 4 if you just want to run your code, and part 5 if you want to do full remote development on VSCode.
- From a local Terminal window, run
ssh greenecomputecontainer
. This should connect you to your running Singularity instance at this point (your Terminal prompt should have "Singularity>" on the left-hand side), after which you can runcd /scratch/[YOUR_NETID_HERE]/video_prediction_project
to cd into your project. - Initialize your conda environment:
conda activate /ext3/conda/dlproj
- Run whatever .py scripts you like.
VERY IMPORTANT NOTE: If doing remote development on the Greene compute node that you requested and set up in Part 3, note that the node will only stay activate for about 3-4 hours max before it gets shut down by HPC. So SAVE OFTEN to avoid losing your work accidentally.
- In VSCode, go to the "Remote Explorer" tab. VSCode should already have populated this will all the locations you specified in your ~/.ssh/config file:
Note: This may require some additional configuration with VSCode to get working. For starters, make sure when you go to settings in VSCode and search "RemoteCommand" you have this option enabled:
- Go to the "Extensions" tab in VSCode and download extensions for Python and Jupyter (should be the first options when you search either).
- In the top menu bar of VSCode, go to Terminal > New Terminal, then run
conda activate /ext3/conda/dlproj
to initialize your conda environment:
- You can now click the "Open Folder" button on the left hand navigation of VSCode and enter /scratch/[YOUR_NETID_HERE]/video_prediction_project/ to get to your project folder, and you can now directly edit the code on your Singularity instance. Note that when editing Jupyter notebooks, you'll need to select the kernel that corresponds to your conda environment name, which should work fine after you've downloaded the Python and Jupyter extensions for VSCode in your remote instance.
In your singularity instance do the following:
-
Make sure all required packages are installed
- Run
pip install -r requirements.txt
- Run
-
Use config.py file to set parameters for pretrain and finetune models.
- If you are pretraining for the first time or reconfiguring the pretraining parameters, set
pretrain
asTrue
in config.py file and change parameters accordingly. Remember than the output dimension of the pretrain model must match dim for input dimension of the finetune model. - If using previously pretrained model, set
pretrain
asFalse
and make suremodel_id
is the name of pretrained model you want to use.- Note: previously pretrained model is saved in your current directory as
VICReg_pretrained_{time}.pth
- Note: previously pretrained model is saved in your current directory as
- If you are pretraining for the first time or reconfiguring the pretraining parameters, set
-
Pretrain and/or Finetune model
- Run
python model_pipeline.py
in your singularity - Once it's finished running, the new pretrained and finetuned models will be saved in your current directory.
- Best finetuned model is selected from the epoch with best validation loss and saved in your current directory as
video_predictor_finetuned_best_val_{time}.pth
- Best finetuned model is selected from the epoch with best validation loss and saved in your current directory as
- Run
-
Predict masks for hidden dataset
- Update the file names of those two models (pretrain and finetuned) in the
submission.ipynb
- run
ls -lah
to find the file names of the most recent models
- run
- Run all codes in
submission.ipynb
- Predicted masks for the hidden dataset will be a tensor of size (2000,160,240) and saved in
submitted_tensor_team12.pt
- Update the file names of those two models (pretrain and finetuned) in the