About the training logger and best_checkpoint.pth

Question

About the training logger and best_checkpoint.pth

Closed this issue 3 years ago · 7 comments

Hi, authors. I am impressed by your great work. And I am now trying to run your code with the same configuration(4 GPUs and the same parameters and the same training dataset). At the first epoch the loss gradually decreases from 18.3 to 17.5. But I got a sharp rise in the training loss at the second epoch. (from 18 to 200+). The loss gradually went to 7000+ in the fourth epoch. In addition, the top 5 error is always more than 99%. To debug this error, I am now generating more logs to investigate. Maybe it is caused by an invalid gradient. If possible, could you please share your training_logger and val_logger results with me? And I will appreciate it if you can share the "Best_checkpoint.pth". Thank you!

what I have changed
"""
configdataset.py->GLDv2_build_train_dataset(csv_path, clean_csv_path, image_dir, output_directory, True, 0.2, 0) from False->True

And the val_logger plot is using the same data as the training logger.
"""

Answer 1 · 2022-01-14T10:01:03.000Z

I have logged the process and there is gradient explosion problem. In your code, the gradient clip is not applied by setting a negative max_nom. Do you use the same configuration in the bash file? I am looking forward to your help. Thanks in advance.

Answer 2 · 2022-01-20T12:29:44.000Z

Thanks for your interest in our work. I do encounter gradient explosion problems occasionally during the training process, and it will be normal after I modified the random seed, you can try to modify it. Of course, this could also be caused by mixed-precision training, and you need to identify the problem yourself. Since our work is related to a company project, we can not provide the optimal model parameters at this time. Please see the following pictures for the loss curve during training and the final test results.

FYI.

Answer 3 · 2022-01-20T13:59:33.000Z

Thanks for your help! I will modify the random seed first. And then check the mixed-precision training. By the way, do you get the results above using the same configuration (lr, batch_size,etc) as the bash file in this respository? And no gradient clipping?

Answer 4 · 2022-01-20T14:33:40.000Z

Yes, I use the parameters shown in the bash file and the gradient clipping is not used.

Answer 5 · 2022-01-21T07:21:02.000Z

Thanks for your quick response. And I have tried several new random seeds other than '11'(the value you used in the bash). However, the gradient explosion occurs every time. About the mixed-precision training, in the current repository you used the DistributedDataParallel module without mixed-precision, right? Sorry that I am not familiar with distributed parallel training. I think the mixed-precision problem is for apex, maybe. I am now also trying to turn off the parallel training with only one GPU.

By the way, I notice some small typos in the code just as mentioned in my first post. In addition, the output log in your test result is also different from the current version in this repository. I am wondering if it is possible you can share the version of codes that generate the loss curve and test result above.

Just for a check, I have summarized the dataset configuration, hope that it is the same as yours.

Total 1580470 images
Total 1280787 images for training
Total 81313 classes for training
Average 15.751318977285305 images per label for training
Total 299683 images for validation
Total 81313 classes for validation
Average 3.6855484362894004 images per label for validation

Thank you very much for your time. I am grateful for your efforts and kindly help. (预祝春节快乐!)

Answer 6 · 2022-01-21T08:05:03.000Z

For the first problem, I use DistributedDataParallel module with mixed-precision to train the network. Yesterday I tried retraining again and did not encounter the gradient explosion problem. So, I don't know what your problem is. The bash file I used is shown below.

For the second problem. The code I use locally is not very different from the current version in the repository, and I removed some output information that was convenient for debugging when I did the experiment before. I use the same training set configuration as you, and this part of the code is referenced from GLDv2.

Answer 7 · 2022-01-22T04:55:07.000Z

Thanks for your sharing. FYI, I redownloaded this repository and only modified some typos in the code. Using the same parameters in your screenshot, the explosion occurs again. Anyway, I want to express my gratitude for your efforts and kindly help again. I will continue the debug work to locate the problem. If you have time, you can also retrain using the code in this repository. Best wishes!