Overfitting issue
johannes-tum opened this issue · 2 comments
Issue Description
I tried to train the network on another private dataset. I started with overfitting on a single image. I noticed that a lot of optimizer steps are skipped, because of invalid gradients. As a consequence, the network did not really converge in even 500 epochs. Once I added this block
for p in self.model.parameters():
if p.grad is None:
p.grad = torch.zeros_like(p)
else:
is_nan = torch.isnan(p.grad)
p.grad[is_nan] = torch.zeros_like(p.grad[is_nan])
after self.scaler.scale(loss).backward() it worked better. But I guess there must be a better way than this.
Guten Tag!
Thank you for mentioning the issue and providing your solution. I believe this situation is caused by a bug in the loss calculation. I forgot to detach the predicted tensor when the BoxMatcher was finding the corresponding bbox. This occurs in
YOLO/yolo/tools/loss_functions.py
Line 91 in 868c821
I have fixed these bugs in commit 4775b4c, but I'm not entirely sure if everything is resolved. I tried training the model on a small dataset, and it seems to be working correctly now. However, some data augmentations are still under development.
I strongly recommend training via the YOLOv9 origin repo to avoid wasting GPU resources. I will release version 1.0 after most of the code is completed.
Mit freundlichen Grüßen,
Henry Tsui
All right! Thanks! I know these things are hard to predict, but do you have a rough time frame in mind when v1 might be ready?