About replicating the experimental results

Question

About replicating the experimental results

Closed this issue 4 years ago · 8 comments

Thank you for contribution！
There are some problems with reproducing the results with your open source code.First the data saved is 33M per file, a little short of the 17M you gave. Secondly, I reproduced the experimental results of CIFAR100, and only achieved about 67% accuracy in the experiment on two 2080TI sheets. In addition I almost reproduced your experimental results on CIFAR10. Could you give me some Pointers, please.

Answer 1 · 2020-08-25T07:07:10.000Z

@cvv-student Thanks for your interests. The 33M checkpoint includes both the model itself (17M) and the training state dict of the optimizer.
For reproducibility, I re-ran the bash I provided in this repo for Cifar100, using 1 RTX 2080Ti, and got 76.95% top-1 accuracy. Please see my training log here.
dydensenet_4336900_71109208.txt

Answer 2 · 2020-08-25T10:15:51.000Z

@zhuogege1943 Thank you for your reply.I'm even surprised that you will reproduce the results for me.I will try to train again with exactly the same settings as yours. Can I refer to all your code (such as mobilenet models)?Because of your interesting work, I hope to reproduce your results and make some changes.

Answer 3 · 2020-08-25T18:02:12.000Z

@cvv-student Sorry, we didn't keep the code for mobilenet~ it's pretty much similar to the implementation as in densenet and resnet. Good luck to your research.

Answer 4 · 2020-08-28T10:52:55.000Z

@zhuogege1943 Hi, after retrying the experiment, I succeeded in reproducing the result.

According to the loss function, the result may be better.I wonder only by change the size of batch and the number of gpu, the result is an error of 10% ( 2 RTX2080ti and 256 batchsize get 67%, 1 RTX2080ti and 64 batchsize get 76%). Beside, I notice that one epoch only takes about 3 minutes in your result, while I need 10 minutes. Do you have any skills to speed up training or adjust batch size?

Answer 5 · 2020-08-28T13:24:26.000Z

@cvv-student Interesting. While I guess other settings also matter, for example, it is recommended to also adjust the learning rate when batch size is changed, see here. The training speed depends on the gpu utilization, the highest speed is achieved when the gpu is 100% utilized during training, which can be seen by nvidia-smi under the GPU-Util label. I didn't take any particular skill for speeding up and only used the code and the bash script provided.

Answer 6 · 2020-08-28T14:04:50.000Z

@zhuogege1943 Yeah, the higher the GPU utilization, the better. But in actual use, the utilization rate of a GPU is only less than 50% (1 RTX2080TI and 64 batch size). So I am curious how you can reach the speed of 3 minutes.

Answer 7 · 2020-08-28T14:39:25.000Z

@cvv-student Probably I store the ImageNet dataset on a ssd disk.
update: The utilization rate < 50% may menas that the dataloader processing on cpu is in low speed, and gpu will wait for cpu. The cpu version in my computer is Intel i7-8700 @ 3.2GHZ. You could also try to increase the number of workers by setting -j in the bash script.

Answer 8 · 2020-08-30T04:25:37.000Z

@zhuogege1943 Thank you for your patience in answering all my questions. There should be no more problems ha ha ha. Good luck with your job.