niuchuangnn/SPICE

Issue in replicating results on CIFAR10

Closed this issue · 21 comments

I am trying to train the model and replicate results for the CIFAR10 dataset. I downloaded the dataset as per instructions and followed steps for training as in the attached file -
trainingStepsCIFAR10.md. It contains command and config used (if different from the default config in repo) along with output snippets.

MOCO is getting trained. I think cluster heads are not getting trained as accuracy is not increasing from 26%. I also tried a lower learning rate of 0.0005 but it didn't help. Finding locally consistent examples fails to find any consistent example at ratio_confident=0.99 (provided default) and only 6770 at ratio_confident=0.9.

Is this normal? If not, can you provide me some direction on where the issue might be present and how it can be resolved?

Detailed logs of training cluster head are in file:
spice-self-logs.txt
There are few errors of broken pipe. I think those are due to the usage of Jupyter for training and reconnections with Jupyter server. It shouldn't affect the convergence of the model.

It is better to start with training tutorial for STL10 dataset, which I have double-checked on this public repo. Make sure that MoCo model is well trainied and intialized before training clutering heads, you can also initialize feature model with the counterpart of our released "*-self" models. I will double-check the CIFAR10 expeiment and let you know my results.

I have reproduced our results. For MOCO training, the hyper parameter moco-k should be set to a smaller value (say, 12800 in our experiments), beacuse the total number of images in CIFAR10 is either 60000 or 50000. So the default value 65536 will not work. Also, the original log file for head training is attached here.

Hi @ML-Guy, thank you for your training steps, I followed them verbatim and everything worked fine until local_consistency.py. Did you manage to get it to run?
I get the following error:

Traceback (most recent call last):                                                             
  File "tools/local_consistency.py", line 222, in <module>                                     
    main()                                                                                     
  File "tools/local_consistency.py", line 101, in main
    main_worker(args.gpu, ngpus_per_node, cfg)
  File "tools/local_consistency.py", line 208, in main_worker
    idx_select, labels_select = model(feas_sim=feas_sim, scores=scores, forward_type="local_consistency")
  File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/gonzales/euroffice-clustering/euroffice_clustering/clustering/models/SPICE/./spice/model/sim2sem.py", line 40, in forward
    return self.head.local_consistency(feas_sim, scores)
  File "/home/gonzales/euroffice-clustering/euroffice_clustering/clustering/models/SPICE/./spice/model/heads/sem_head_multi.py", line 41, in local_consistency
    idx_true = idx_true * idx_conf
RuntimeError: The size of tensor a (50000) must match the size of tensor b (60000) at non-singleton dimension 0

This is the output log right before the error.

@niuchuangnn do you have any idea of what might be causing it?

My eval.py config file looks like this (everything else is the same):

model_name = "eval"
weight = './results/cifar10/spice_self/checkpoint_best.pth.tar'
model_type = "clusterresnet"

Thank you in advance

You need to do all=False in config files for both train and test data.

On cifar10, should I change the architecture of train_moco.py from clusterresnet to resnet18_cifar? The accuracy is very low when I use clusterresnet.

The accuracy of train_spice_v2.py is very low in the first several epochs:
Epoch: [1][ 2/500] Time 40.027 (39.098) Data 11.692 (11.372) Loss_0 2.1618e+00 (2.1787e+00) Loss_1 2.1596e+00 (2.1747e+00) Loss_2 2.2034e+00 (2.1826e+00) Loss_3 2.1306e+00 (2.1782e+00) Loss_4 2.1407e+00 (2.1796e+00) Loss_5 2.1169e+00 (2.1802e+00) Loss_6 2.1898e+00 (2.1781e+00) Loss_7 2.1425e+00 (2.1755e+00) Loss_8 2.1089e+00 (2.1796e+00) Loss_9 2.1751e+00 (2.1743e+00) lr 0.005000 (0.005000)
2022-07-19 07:37:12,981 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 3/500] Time 38.129 (38.856) Data 11.648 (11.320) Loss_0 2.1372e+00 (2.1774e+00) Loss_1 2.1617e+00 (2.1741e+00) Loss_2 2.1962e+00 (2.1816e+00) Loss_3 2.1372e+00 (2.1761e+00) Loss_4 2.1570e+00 (2.1787e+00) Loss_5 2.1555e+00 (2.1797e+00) Loss_6 2.1160e+00 (2.1763e+00) Loss_7 2.1317e+00 (2.1749e+00) Loss_8 2.1468e+00 (2.1787e+00) Loss_9 2.1844e+00 (2.1728e+00) lr 0.005000 (0.005000)
2022-07-19 07:37:51,111 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 4/500] Time 38.075 (38.700) Data 11.640 (11.289) Loss_0 2.1048e+00 (2.1777e+00) Loss_1 2.1332e+00 (2.1741e+00) Loss_2 2.1730e+00 (2.1815e+00) Loss_3 2.1412e+00 (2.1762e+00) Loss_4 2.1982e+00 (2.1787e+00) Loss_5 2.1537e+00 (2.1800e+00) Loss_6 2.1517e+00 (2.1767e+00) Loss_7 2.1767e+00 (2.1757e+00) Loss_8 2.1568e+00 (2.1794e+00) Loss_9 2.1363e+00 (2.1736e+00) lr 0.005000 (0.005000)
2022-07-19 07:38:29,187 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 5/500] Time 38.098 (38.599) Data 11.636 (11.268) Loss_0 2.1055e+00 (2.1776e+00) Loss_1 2.1481e+00 (2.1730e+00) Loss_2 2.1377e+00 (2.1805e+00) Loss_3 2.0997e+00 (2.1756e+00) Loss_4 2.1591e+00 (2.1784e+00) Loss_5 2.1587e+00 (2.1783e+00) Loss_6 2.1398e+00 (2.1762e+00) Loss_7 2.1612e+00 (2.1748e+00) Loss_8 2.1285e+00 (2.1796e+00) Loss_9 2.1878e+00 (2.1743e+00) lr 0.005000 (0.005000)
2022-07-19 07:39:07,285 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 6/500] Time 38.103 (38.528) Data 11.695 (11.259) Loss_0 2.0877e+00 (2.1766e+00) Loss_1 2.1730e+00 (2.1721e+00) Loss_2 2.1257e+00 (2.1786e+00) Loss_3 2.1406e+00 (2.1745e+00) Loss_4 2.1732e+00 (2.1779e+00) Loss_5 2.1054e+00 (2.1778e+00) Loss_6 2.1482e+00 (2.1751e+00) Loss_7 2.0977e+00 (2.1737e+00) Loss_8 2.0675e+00 (2.1784e+00) Loss_9 2.1664e+00 (2.1734e+00) lr 0.005000 (0.005000)
2022-07-19 07:39:45,388 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 7/500] Time 38.083 (38.473) Data 11.645 (11.250) Loss_0 2.1543e+00 (2.1760e+00) Loss_1 2.1649e+00 (2.1726e+00) Loss_2 2.1584e+00 (2.1786e+00) Loss_3 2.1290e+00 (2.1743e+00) Loss_4 2.1437e+00 (2.1781e+00) Loss_5 2.1485e+00 (2.1771e+00) Loss_6 2.1543e+00 (2.1753e+00) Loss_7 2.1441e+00 (2.1734e+00) Loss_8 2.1349e+00 (2.1784e+00) Loss_9 2.1913e+00 (2.1733e+00) lr 0.005000 (0.005000)
2022-07-19 07:40:23,472 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 8/500] Time 38.183 (38.441) Data 11.620 (11.239) Loss_0 2.1825e+00 (2.1760e+00) Loss_1 2.1715e+00 (2.1732e+00) Loss_2 2.1875e+00 (2.1788e+00) Loss_3 2.1749e+00 (2.1738e+00) Loss_4 2.2054e+00 (2.1780e+00) Loss_5 2.2236e+00 (2.1772e+00) Loss_6 2.1488e+00 (2.1752e+00) Loss_7 2.1792e+00 (2.1740e+00) Loss_8 2.2036e+00 (2.1780e+00) Loss_9 2.1946e+00 (2.1732e+00) lr 0.005000 (0.005000)
2022-07-19 07:41:01,656 spice.trainer INFO: None
INFO:spice.trainer:None
Epoch: [1][ 9/500] Time 38.274 (38.424) Data 11.713 (11.237) Loss_0 2.1429e+00 (2.1758e+00) Loss_1 2.1891e+00 (2.1737e+00) Loss_2 2.1617e+00 (2.1788e+00) Loss_3 2.1420e+00 (2.1742e+00) Loss_4 2.1224e+00 (2.1780e+00) Loss_5 2.1207e+00 (2.1773e+00) Loss_6 2.1280e+00 (2.1753e+00) Loss_7 2.0942e+00 (2.1743e+00) Loss_8 2.1652e+00 (2.1783e+00) Loss_9 2.1275e+00 (2.1732e+00) lr 0.005000 (0.005000)
2022-07-19 07:41:39,930 spice.trainer INFO: None
INFO:spice.trainer:None
2022-07-19 07:41:59,217 spice INFO: Real: ACC: 0.23276, NMI: 0.10334617888204631, ARI: 0.05324862727107204, head: 0
INFO:spice:Real: ACC: 0.23276, NMI: 0.10334617888204631, ARI: 0.05324862727107204, head: 0
2022-07-19 07:41:59,217 spice INFO: Loss: ACC: 0.22924, NMI: 0.10504999152299856, ARI: 0.055460127404697385, head: 6
INFO:spice:Loss: ACC: 0.22924, NMI: 0.10504999152299856, ARI: 0.055460127404697385, head: 6
2022-07-19 07:42:00,268 spice INFO: FINAL -- Best ACC: 0.23276, Best NMI: 0.10334617888204631, Best ARI: 0.05324862727107204, epoch: 1, head: 0
INFO:spice:FINAL -- Best ACC: 0.23276, Best NMI: 0.10334617888204631, Best ARI: 0.05324862727107204, epoch: 1, head: 0
2022-07-19 07:42:00,268 spice INFO: FINAL -- Select ACC: 0.22924, Select NMI: 0.10504999152299856, Select ARI: 0.055460127404697385, epoch: 1, head: 6
INFO:spice:FINAL -- Select ACC: 0.22924, Select NMI: 0.10504999152299856, Select ARI: 0.055460127404697385, epoch: 1, head: 6.

But the result provided in google driver is larger than 60%.
The final epoch log information is:

Epoch: [999][410/468] Time 0.096 ( 0.104) Data 0.000 ( 0.036) Loss 5.0032e+00 (4.9887e+00) Acc@1 90.62 ( 86.45) Acc@5 93.75 ( 94.63)
Epoch: [999][420/468] Time 0.103 ( 0.104) Data 0.000 ( 0.035) Loss 5.0059e+00 (4.9895e+00) Acc@1 81.25 ( 86.42) Acc@5 96.88 ( 94.65)
Epoch: [999][430/468] Time 0.105 ( 0.104) Data 0.000 ( 0.034) Loss 5.2582e+00 (4.9899e+00) Acc@1 81.25 ( 86.41) Acc@5 93.75 ( 94.67)
Epoch: [999][440/468] Time 0.094 ( 0.104) Data 0.000 ( 0.034) Loss 5.0616e+00 (4.9902e+00) Acc@1 84.38 ( 86.42) Acc@5 93.75 ( 94.65)
Epoch: [999][450/468] Time 0.104 ( 0.104) Data 0.000 ( 0.033) Loss 4.9001e+00 (4.9901e+00) Acc@1 93.75 ( 86.45) Acc@5 96.88 ( 94.66)
Epoch: [999][460/468] Time 0.104 ( 0.104) Data 0.000 ( 0.032) Loss 5.0555e+00 (4.9899e+00) Acc@1 90.62 ( 86.50) Acc@5 90.62 ( 94.66)

I think the MoCo model is well trained, and it has been used to initialize the feature model. And when I trained MoCo, the moco-k is setted to 12800.

@YexiongLin I have not been able to replicate it yet. I am also working on it. @niuchuangnn If you can provide MoCo training config, that will be great. I have tried following:

I tested with provided pre-trained moco backbone. Training accuracy starts from 73 and converges to 83.4 (
log pretrained.txt
). This means spice_self_v2.py script works.

I tested with my own MoCo pertaining -
First I tried clusterresnet with image size 96 (default):

  • model_type = "clusterresnet" moko_k 12800 epoch 560- accuracy stuck at 22-23 %
    log 560.txt

  • model_type = "clusterresnet" moko_k 12800 epoch 1313- accuracy stuck at 22-23 %
    log 1313.txt

  • model_type = "clusterresnet" moko_k 12800 epoch 1000 - accuracy stuck at 22-23 %

  • model_type = "clusterresnet" moko_k 1536 epoch 1000-- accuracy stuck at 25-27 %

Now I am trying resnet18_cifar with image size 32 (as per paper):
moco_cifarresnet.txt

python tools/train_moco.py --moco-k=12800 --epochs=1000 --batch-size=512 --img_size=32 --arch=resnet18_cifar

@niuchuangnn can you provide MoCo training config file or any necessary tricks?

@ML-Guy @YexiongLin Thanks for your interest in this work. I am now busy with other projects. Anyway, I will share a tutorial on the CIFAR-10 dataset within the next 7 days.

@ML-Guy @YexiongLin Please try the new version!

A tutorial for CIFAR10 is added here, please let me know your results. Thanks!

A tutorial for CIFAR10 is added here, please let me know your results. Thanks!

Thank you for your tutorial, the model can achieve an accuracy on 0.83778.

A tutorial for CIFAR10 is added here, please let me know your results. Thanks!

In the 4th step of the tutorial,
python tools/local_consistency.py --config-file ./configs/stl10/eval.py --embedding ./results/cifar10/embedding/feas_moco_512_l2.npy
May it be "--config-file ./configs/cifar10/eval.py" ?

Do you mean the final result or the result from the second stage is 0.83778?

A tutorial for CIFAR10 is added here, please let me know your results. Thanks!

In the 4th step of the tutorial, python tools/local_consistency.py --config-file ./configs/stl10/eval.py --embedding ./results/cifar10/embedding/feas_moco_512_l2.npy May it be "--config-file ./configs/cifar10/eval.py" ?

Thanks! Sometimes there are some mistakes during copying and pasting from multiple old versions.

Do you mean the final result or the result from the second stage is 0.83778?

the second stage.

Do you mean the final result or the result from the second stage is 0.83778?

the second stage.

OK, it is slightly lower than the reported result. Maybe the hyperparameters for training MoCo is slightly different or there is a variance in different trials on CIFAR-10. I will release my pretrainied MoCo model and test the second training stage again.

Do you mean the final result or the result from the second stage is 0.83778?

the second stage.

OK, it is slightly lower than the reported result. Maybe the hyperparameters for training MoCo is slightly different or there is a variance in different trials on CIFAR-10. I will release my pretrainied MoCo model and test the second training stage again.

The best accuracy is 0.84028, it still is lower than the reported result.

Got it. Maybe a 0.005 difference should be a normal variance. You can try the third training stage and let me know the result, please. Testing on STL10 should have a very small variance.

@YexiongLin I have reproduced our results, sometimes even better than the reported results. You could try to start with our pretrained MoCo model here.

@YexiongLin I have reproduced our results, sometimes even better than the reported results. You could try to start with our pretrained MoCo model here.

Thank you. I trained the SPICE-Semi using 1 RTX3090, but the best accuracy is 0.8994.

[2022-08-09 14:27:02,480 INFO] model saved: ./results/cifar10/spice_semi/semi/model_1047000.pth INFO:semi:model saved: ./results/cifar10/spice_semi/semi/model_1047000.pth [2022-08-09 14:37:40,576 INFO] 1048000 iteration, USE_EMA: True, {'train/sup_loss': tensor(8.4844e-05, device='cuda:0'), 'train/unsup_loss': tensor(0.0382, device='cuda:0'), 'train/total_loss': tensor(0.0382, device='cuda:0'), 'train/mask_ratio': tensor(0.0022, device='cuda:0'), 'lr': 0.005874884396511934, 'train/prefecth_time': 0.00026953598856925966, 'train/run_time': 0.6186185302734375, 'eval/loss': -1, 'eval/top-1-acc': 0.8958, 'eval/nmi': 0.8265365069171046, 'eval/ari': 0.7981030884105477}, BEST_EVAL_ACC: 0.8994, at 540000 iters INFO:semi:1048000 iteration, USE_EMA: True, {'train/sup_loss': tensor(8.4844e-05, device='cuda:0'), 'train/unsup_loss': tensor(0.0382, device='cuda:0'), 'train/total_loss': tensor(0.0382, device='cuda:0'), 'train/mask_ratio': tensor(0.0022, device='cuda:0'), 'lr': 0.005874884396511934, 'train/prefecth_time': 0.00026953598856925966, 'train/run_time': 0.6186185302734375, 'eval/loss': -1, 'eval/top-1-acc': 0.8958, 'eval/nmi': 0.8265365069171046, 'eval/ari': 0.7981030884105477}, BEST_EVAL_ACC: 0.8994, at 540000 iters [2022-08-09 14:37:41,393 INFO] model saved: ./results/cifar10/spice_semi/semi/model_last.pth INFO:semi:model saved: ./results/cifar10/spice_semi/semi/model_last.pth [2022-08-09 14:37:41,591 INFO] model saved: ./results/cifar10/spice_semi/semi/model_1048000.pth INFO:semi:model saved: ./results/cifar10/spice_semi/semi/model_1048000.pth [2022-08-09 14:44:07,721 INFO] model saved: ./results/cifar10/spice_semi/semi/latest_model.pth INFO:semi:model saved: ./results/cifar10/spice_semi/semi/latest_model.pth WARNING:root:GPU 0 training is FINISHED

Training based on pretrained MoCo model may be better.

@YexiongLin Thanks for posting your results here using one GPU. Hope you can reproduce our results using our pretrained MoCo using four GPUs.