facebookresearch/DeeperCluster

Running the code on different scheduler/single GPU

meghbhalerao opened this issue · 6 comments

Hi all,

Thanks a lot for making the code public! The repository says that the code is currently adapted only for distributed training on the SLURM scheduler, right? I have 2 questions:

  1. Could you please point me to the place where I would have to make changes to adapt the code, if I am using a different compute cluster? (I am using an SGE (Sun Grid Engine) based HPC cluster)
  2. Can I run this repo if I do not have a compute cluster i.e., just simply on my local machine, which has 1/2 GPUs?

Thanks in advance!
Megh

Hi Megh,
Thanks for your interest in this work. Yes, you can launch the code locally by using the torch.distributed.launch utility. You can follow the scenario 2 described here https://github.com/facebookresearch/DeeperCluster#distributed-training

export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py --dist-url env:// [--arg1 --arg2 etc] 

I guess you'll have to comment

init_signal_handler()
and probably adapt this part
args.is_slurm_job = 'SLURM_JOB_ID' in os.environ and not args.debug_slurm
if args.is_slurm_job:
args.rank = int(os.environ['SLURM_PROCID'])
else:
# jobs started with torch.distributed.launch
# read environment variables
args.rank = int(os.environ['RANK'])
args.world_size = int(os.environ['WORLD_SIZE'])

Hi @mathildecaron31,
Thanks a lot for the information, I will try what you mentioned and let you know!

Hi @mathildecaron31,
I am trying to run the code now, and I think I have almost gotten it to work. However, I am facing some issues. I have 2 GPUs on my node, and here is my main.sh:

mkdir -p ./exp/deepercluster/
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py \
--dump_path ./exp/deepercluster/ \
--data_path ./data/clipart/train \
--size_dataset 20000 \
--workers 4 \
--sobel false \
--lr 0.1 \
--wd 0.00001 \
--nepochs 100 \
--batch_size 48 \
--reassignment 3 \
--dim_pca 4096 \
--super_classes 4 \
--rotnet true \
--k 1 \
--warm_restart false \
--use_faiss true \
--niter 10 \
--world-size 64 \
--dist-url env://

As it can be seen, I am trying to train a rotnet (hence I need to set the super_classes = 4, as mentioned in the README), but I keep getting this error:

assert args.world_size % args.super_classes == 0
AssertionError

When I print args.world_size I get 2 if I set NGPU=2 and 1 if NGPU=1.

From what I could understand:

  1. args.world_size = number of GPUs available on my machine (according to README)
  2. Why are we passing world_size as an argument to the main.py? Because, in utils.py, this does not seem to be getting used? In the utils.py
    https://github.com/facebookresearch/DeeperCluster/blob/master/src/utils.py#L67
    the world_size is getting assigned independently of any argument that is being passed to the main.py file

Any ideas how to tackle this?

Thanks again,
Megh

Hi @meghbhalerao
Sorry for the delay of my reply:

  1. Yes
  2. If you use scenario 2 you do not need to pass world_size as an argument since it is set automatically directly in the code as you pointed out correctly.

If you want to run on 2 gpus you should use NGPU=2 and not NGPU=1 like in the example you printed.

Unfortunately you cannot run rotnet with only 2 gpus with this implementation as it is. Indeed each process sees only one rotation. As there are 4 rotation classes, there should be at least 4 processes.

Hi @mathildecaron31,
No problem and thanks a lot for your help. I understand now. Closing this for now.