run time and outputs for main.py + visualization of results

Hi -

Thanks a lot for your work and for sharing the code!

I have been trying to make it run and have a few questions:
1- I managed to run "eval_voc_classif.py" until completion but have some issues running “main” using this command (submitted via a batch script on a slum cluster):
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU main.py --dump_path ./output_4/ --data_path ‘./YFCC100M/src/data/' --size_dataset 100000000 --workers 10 --sobel true --lr 0.1 --wd 0.00001 --nepochs 100 --batch_size 48 --reassignment 3 --dim_pca 4096 --super_classes 1 --rotnet false --k 320000 --warm_restart false --use_faiss true --niter 10 --world-size 4 --dist-url 'file:///gpfs/data/url/test4'

If the job seems to still be running, the log file ends with

“
INFO - 06/28/20 04:18:17 - 0:00:02 - pretrained weights not found
INFO - 06/28/20 04:18:17 - 0:00:02 - model created
INFO - 06/28/20 04:18:17 - 0:00:02 - pretrained weights not found”

no output seems to have been generated since it was launched more than 24 hours ago:

$ ls -lth output_4/*
10K Jun 28 04:18 output_4/train.log
988 Jun 28 04:18 output_4/params.pkl
output_4/cache:
total 0
output_4/checkpoints:
total 0

Is it indeed slow, or did I miss something?

2 - regarding the visualization of the results:
a- I don’t see the code to generate Figure 5 (extraction of 9 random images for clusters). Is it somewhere or could you send some guidance on the best way to use and analyze the different outputs to obtain a similar result?
b- what is the best way to visually see the different clusters generated by the algorithms? Like some sort of UMAP where we could see clusters of points, with “true” labels associated with each point represented as a color for example

Thanks again,
Best,
Nicolas

Hi Nicolas,
Thanks for your interest in this work.

Reading your log file it seems that your run is stuck somewhere there because model to cuda has not been printed. I recommend you print stuff at every line to see what's causing the problem.
I haven't included the code for Figure 5. For analyzing the clusters with respect to the metadata, I recommend to log the cluster assignments for each image of your dataset. Then you can compare these cluster assignments with the partition given by metadata (with NMI for example in Figure 4).

Hope that helps.

Thanks for the tips, I'll have a look.

Merci!

Hi -
Regarding Q1, I've found the instruction that gets stuck:
net = DDP(net, delay_allreduce=True)

It is just stuck at this lines for days. Would you by any chance have any idea why this would happen? Have you experienced any issue with that "parallel" library on some installations? (environment was set in anaconda and used on a slurm cluster, using 1 to 4 GPUs in my tests)

Thanks,
Best,
Nicolas

@ncoudray Hello，I wonder if you complete the visualization of results? If you have finished it, could you shared the code with me?
Thank you !

Hi Alice - I haven't been that far yet - I'm still stuck with the issue above actually and haven't got back to it yet
Best
N.

Hi @ncoudray,

It can be that the processes are not initialized correctly. A way to debug this would be to print rank and world_size just before the distributed init method, here:

DeeperCluster/src/utils.py

Line 68 in d38ada1

If you run with 4 GPUs then you should have 4 processes running, let's say P0 P1 P2 and P3. Then P0 should print 0, P1 should print 1 and so on so forth. They all should print 4 for the world size.

Also please reopen the issue if you want further assistance. If the issue is closed I do not get notified and therefore I forget to reply ....

Thanks - I had to put it in parenthesis now, but will get back to testing it in a few weeks and will get back to you about it then. Thanks for the feedback