Running the code on different scheduler/single GPU
meghbhalerao opened this issue · 6 comments
Hi all,
Thanks a lot for making the code public! The repository says that the code is currently adapted only for distributed training on the SLURM scheduler, right? I have 2 questions:
- Could you please point me to the place where I would have to make changes to adapt the code, if I am using a different compute cluster? (I am using an SGE (Sun Grid Engine) based HPC cluster)
- Can I run this repo if I do not have a compute cluster i.e., just simply on my local machine, which has 1/2 GPUs?
Thanks in advance!
Megh
Hi Megh,
Thanks for your interest in this work. Yes, you can launch the code locally by using the torch.distributed.launch
utility. You can follow the scenario 2 described here https://github.com/facebookresearch/DeeperCluster#distributed-training
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py --dist-url env:// [--arg1 --arg2 etc]
Hi @mathildecaron31,
Thanks a lot for the information, I will try what you mentioned and let you know!
Hi @mathildecaron31,
I am trying to run the code now, and I think I have almost gotten it to work. However, I am facing some issues. I have 2 GPUs on my node, and here is my main.sh
:
mkdir -p ./exp/deepercluster/
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py \
--dump_path ./exp/deepercluster/ \
--data_path ./data/clipart/train \
--size_dataset 20000 \
--workers 4 \
--sobel false \
--lr 0.1 \
--wd 0.00001 \
--nepochs 100 \
--batch_size 48 \
--reassignment 3 \
--dim_pca 4096 \
--super_classes 4 \
--rotnet true \
--k 1 \
--warm_restart false \
--use_faiss true \
--niter 10 \
--world-size 64 \
--dist-url env://
As it can be seen, I am trying to train a rotnet
(hence I need to set the super_classes = 4
, as mentioned in the README
), but I keep getting this error:
assert args.world_size % args.super_classes == 0
AssertionError
When I print args.world_size
I get 2 if I set NGPU=2
and 1 if NGPU=1
.
From what I could understand:
args.world_size = number of GPUs
available on my machine (according toREADME
)- Why are we passing
world_size
as an argument to themain.py
? Because, inutils.py
, this does not seem to be getting used? In theutils.py
https://github.com/facebookresearch/DeeperCluster/blob/master/src/utils.py#L67
theworld_size
is getting assigned independently of any argument that is being passed to themain.py
file
Any ideas how to tackle this?
Thanks again,
Megh
Hi @meghbhalerao
Sorry for the delay of my reply:
- Yes
- If you use scenario 2 you do not need to pass world_size as an argument since it is set automatically directly in the code as you pointed out correctly.
If you want to run on 2 gpus you should use NGPU=2
and not NGPU=1
like in the example you printed.
Unfortunately you cannot run rotnet with only 2 gpus with this implementation as it is. Indeed each process sees only one rotation. As there are 4 rotation classes, there should be at least 4 processes.
Hi @mathildecaron31,
No problem and thanks a lot for your help. I understand now. Closing this for now.