Overfitting issue

Issue Description
I tried to train the network on another private dataset. I started with overfitting on a single image. I noticed that a lot of optimizer steps are skipped, because of invalid gradients. As a consequence, the network did not really converge in even 500 epochs. Once I added this block

        for p in self.model.parameters(): 
            if p.grad is None: 
                p.grad = torch.zeros_like(p)
            else:
                is_nan = torch.isnan(p.grad)
                p.grad[is_nan] = torch.zeros_like(p.grad[is_nan])

after self.scaler.scale(loss).backward() it worked better. But I guess there must be a better way than this.

Guten Tag!

Thank you for mentioning the issue and providing your solution. I believe this situation is caused by a bug in the loss calculation. I forgot to detach the predicted tensor when the BoxMatcher was finding the corresponding bbox. This occurs in

YOLO/yolo/tools/loss_functions.py

Line 91 in 868c821

    
           align_targets, valid_masks = self.matcher(targets, (predicts_cls.detach(), predicts_box.detach()))

I have fixed these bugs in commit 4775b4c, but I'm not entirely sure if everything is resolved. I tried training the model on a small dataset, and it seems to be working correctly now. However, some data augmentations are still under development.

I strongly recommend training via the YOLOv9 origin repo to avoid wasting GPU resources. I will release version 1.0 after most of the code is completed.

Mit freundlichen Grüßen,
Henry Tsui

All right! Thanks! I know these things are hard to predict, but do you have a rough time frame in mind when v1 might be ready?