Running the code on different scheduler/single GPU

Hi all,

Thanks a lot for making the code public! The repository says that the code is currently adapted only for distributed training on the SLURM scheduler, right? I have 2 questions:

Could you please point me to the place where I would have to make changes to adapt the code, if I am using a different compute cluster? (I am using an SGE (Sun Grid Engine) based HPC cluster)
Can I run this repo if I do not have a compute cluster i.e., just simply on my local machine, which has 1/2 GPUs?

Thanks in advance!
Megh

Hi Megh,
Thanks for your interest in this work. Yes, you can launch the code locally by using the torch.distributed.launch utility. You can follow the scenario 2 described here https://github.com/facebookresearch/DeeperCluster#distributed-training

export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py --dist-url env:// [--arg1 --arg2 etc]

I guess you'll have to comment

DeeperCluster/main.py

Line 126 in d38ada1

init_signal_handler()

and probably adapt this part

DeeperCluster/src/utils.py

Lines 59 to 67 in d38ada1

    
           args.is_slurm_job = 'SLURM_JOB_ID' in os.environ and not args.debug_slurm 
        
           if args.is_slurm_job: 
        
               args.rank = int(os.environ['SLURM_PROCID']) 
        
           else: 
        
               # jobs started with torch.distributed.launch 
        
               # read environment variables 
        
               args.rank = int(os.environ['RANK']) 
        
               args.world_size = int(os.environ['WORLD_SIZE'])

Hi @mathildecaron31,
Thanks a lot for the information, I will try what you mentioned and let you know!

Hi @mathildecaron31,
I am trying to run the code now, and I think I have almost gotten it to work. However, I am facing some issues. I have 2 GPUs on my node, and here is my main.sh:

mkdir -p ./exp/deepercluster/
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py \
--dump_path ./exp/deepercluster/ \
--data_path ./data/clipart/train \
--size_dataset 20000 \
--workers 4 \
--sobel false \
--lr 0.1 \
--wd 0.00001 \
--nepochs 100 \
--batch_size 48 \
--reassignment 3 \
--dim_pca 4096 \
--super_classes 4 \
--rotnet true \
--k 1 \
--warm_restart false \
--use_faiss true \
--niter 10 \
--world-size 64 \
--dist-url env://

As it can be seen, I am trying to train a rotnet (hence I need to set the super_classes = 4, as mentioned in the README), but I keep getting this error:

assert args.world_size % args.super_classes == 0
AssertionError

When I print args.world_size I get 2 if I set NGPU=2 and 1 if NGPU=1.

From what I could understand:

args.world_size = number of GPUs available on my machine (according to README)
Why are we passing world_size as an argument to the main.py? Because, in utils.py, this does not seem to be getting used? In the utils.py
https://github.com/facebookresearch/DeeperCluster/blob/master/src/utils.py#L67
the world_size is getting assigned independently of any argument that is being passed to the main.py file

Any ideas how to tackle this?

Thanks again,
Megh

Hi @meghbhalerao
Sorry for the delay of my reply:

Yes
If you use scenario 2 you do not need to pass world_size as an argument since it is set automatically directly in the code as you pointed out correctly.

If you want to run on 2 gpus you should use NGPU=2 and not NGPU=1 like in the example you printed.

Unfortunately you cannot run rotnet with only 2 gpus with this implementation as it is. Indeed each process sees only one rotation. As there are 4 rotation classes, there should be at least 4 processes.

Hi @mathildecaron31,
No problem and thanks a lot for your help. I understand now. Closing this for now.

	args.is_slurm_job = 'SLURM_JOB_ID' in os.environ and not args.debug_slurm

	if args.is_slurm_job:
	args.rank = int(os.environ['SLURM_PROCID'])
	else:
	# jobs started with torch.distributed.launch
	# read environment variables
	args.rank = int(os.environ['RANK'])
	args.world_size = int(os.environ['WORLD_SIZE'])