[resolved, user is now able to reproduce results for ImageNet] ImageNet experiments

Question

[resolved, user is now able to reproduce results for ImageNet] ImageNet experiments

Closed this issue 2 years ago · 15 comments

Hi, thanks for releasing the code.
When I reproduce the table 6 Imagenet-C result in your paper (severity level 5), the source error is 82.4 and the BN adapt is 72.1, which is the same as your paper.
However, Tent result is 69.2 and the CoTTA result is about 71.6, which is inconsistent with your paper, in which the tent is 66.5 and CoTTA is 63.
Could you explain why the result is different?
Thanks

Answer 1 · 2022-04-23T08:49:18.000Z

Same Hi here, thanks for releasing the code! I was also wondering why my results are different on CIFAR-10C. Without resetting, the accuracy of CoTTA and Tent both dropped rapidly. So I want to know if there were wrong operations on my reproducing process. What I did additionally was to replace the standard model to Modas2021PRIMEResNet18 from RobustBenchmark.
Thanks for any help!

Answer 2 · 2022-04-23T09:23:15.000Z

Hi,
I have fixed the problem. I notice that resetting operation is important, while tent and cotta are continual, the results seem well, but I have another question, I notice that the hyper-parameter is the same as tent in Imagenet-C, why the reproduced result is worse than that paper?

Answer 3 · 2022-04-23T09:29:32.000Z

OK, thanks for your reply! So that is to say, "resetting" is necessary to achieve good performance on both Tent and CoTTA ? I've deemed that CoTTA can achieve its reported results without any domain additional information (to tell the model when to reset).

Answer 4 · 2022-04-23T09:37:56.000Z

Actually, In Table 4, tent-online is much better than tent-continual, while in Imagenet-C, it is opposite, so it depends

Answer 5 · 2022-04-23T11:25:54.000Z

Thanks for your interest in our work.

@HaoYang-219 May I ask for your setup for ImageNet results? Did you use the same environment and our test script which evaluate all 10 corruption sequences? Internally we tested this on several different servers and results were reproducible. CoTTA results shall be around 63 for ImageNet.

Can you maybe share your logs?

Answer 6 · 2022-04-23T11:34:36.000Z

Author here. Thanks for your interest in our work.
@kadmkbl
Resetting Resetting is not used in our code because we consider the continual adaptation case.
CIFAR10 results We only tested the Standard Model from Robustbench, where the same hyperparameters (learning rate etc) as TENT are used. To use a different architecture as you did, a modification on the learning rate might be needed. For CoTTA, If you find the model forgetting previous knowledge, you can also make the restoration factor~(RST) larger.

Please let me know if you have difficulty to reproduce the CIFAR10 results using the same setup as the repo.

Answer 7 · 2022-04-23T11:44:16.000Z

Thanks a lot, I will change it to the standard arch. I will let you know if it works out or other problems occur.

Answer 8 · 2022-04-23T11:54:20.000Z

Hi, I have fixed the problem. I notice that resetting operation is important, while tent and cotta are continual, the results seem well, but I have another question, I notice that the hyper-parameter is the same as tent in Imagenet-C, why the reproduced result is worse than that paper?

@HaoYang-219
It seems that you are now able to reproduce our ImageNet results for CoTTA?

TENT results We use the Standard_R50 model from RobustBench for our ImageNet experiments. TENT seems to use something different~(TENT code for ImageNet is not available) as our baseline result does not match. Anyhow, our setup are very different between the two papers. We consider the continual adaptation case, and TENT reset whenever it meets a new domain. Therefore, the results are not directly comparable, especially for ImageNet (because the domain sequence is different).

Answer 9 · 2022-04-23T12:00:36.000Z

Hi, I have fixed the problem. I notice that resetting operation is important, while tent and cotta are continual, the results seem well, but I have another question, I notice that the hyper-parameter is the same as tent in Imagenet-C, why the reproduced result is worse than that paper?

@HaoYang-219 It seems that you are now able to reproduce our ImageNet results for CoTTA?

TENT results We use the Standard_R50 model from RobustBench for our ImageNet experiments. TENT seems to use something different~(TENT code for ImageNet is not available) as our baseline result does not match. Anyhow, our setup are very different between the two papers. We consider the continual adaptation case, and TENT reset whenever it meets a new domain. Therefore, the results are not directly comparable, especially for ImageNet (because the domain sequence is different).

yes, thanks for your reply

Answer 10 · 2022-04-23T12:13:14.000Z

In case future questions on ImageNet, you can refer to the following CoTTA logs for Table 6. We evaluated the code in this repo for multiple times on several servers. Results shall be around 63.

cotta_logs.tar.gz

Some summary for the discussion:

Our ImageNet setup consider the continual ImageNet adaptation. Reset was never used, the severity level is 5, and the model is Standard_R50 from robustbench.
You can use the eval.py to quickly evaluate the mean error for the 10 sequences. The reported result is the mean over 10 different sequences.

Answer 11 · 2023-04-11T03:13:15.000Z

@qinenergy hi, I am unable to reproduce the results on ImageNet, could reproduce results on CIFAR-10 and 100. Any suggestions?

My final output is:

read cotta files:
read 10 files.
[79.21333333333332, 83.21866666666668, 79.78, 81.048, 93.33200000000001, 78.82000000000001, 96.352, 84.70266666666666, 77.24, 78.17733333333332]
(83.1884, 6.245348136902469)

Answer 12 · 2023-04-11T07:51:29.000Z

@hasibzunair Hi, thanks for the interest in our work.
Could you send me your log files to my email? I can take a look.
May I ask if you use the same architecture and environment as listed in the repository? You can also compare your log to ours above to see if you find any discrepancy on the setups.

Answer 13 · 2023-04-11T08:14:12.000Z

@qinenergy yup, I ran the code as is. Just changed the batch size, which I also did for CIFAR-100. Running on a NVIDIA 3080Ti GPU with CUDA 11. Attached are the logs in a folder:

output.zip

Answer 14 · 2023-04-11T08:25:28.000Z

Hi, I think you changed the batchsize aggressively.
Changing from 64 to 16 would definitely require you to re-tune the hyperparameters, at least for the learning rate.

Please use 64 as in our experiments to reproduce the experiments, thanks.

Answer 15 · 2023-04-11T08:58:11.000Z

Thanks for the suggestion! Will do that.