Loss computation sometimes cause nan values

Question

Loss computation sometimes cause nan values

tobymuller233 opened this issue a month ago · 3 comments

tobymuller233 commented a month ago

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

These days when I'm trying to fine tune my model after pruning by training for several epochs, I found that loss value becomes nan from time to time. By setting breakpoints and checking, I found that there's a bug in metrics.py
Sometimes, if the prediction of some bounding box has a width or height of 0, it turns out to be nan values! Since in CIoU computation, h2 and h1 are used as dividers here.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

Answer 1 · 2024-11-15T08:43:45.000Z

👋 Hello @tobymuller233, thank you for your interest in YOLOv5 🚀! It seems like you're encountering a nan values issue during training, and there might be a potential bug in the metrics.py file. To assist, we'll need a bit more information.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us understand and debug the issue. This would include steps to replicate the bug, relevant sections of your code, and any specific error messages.

Additionally, it would be helpful to know more about your environment setup, such as the version of Python, PyTorch, and any other dependencies you are using.

If you have any further insights, like dataset characteristics or specific conditions that might trigger this issue, do share those as well.

Please note that this is an automated response, and an Ultralytics engineer will review your issue and provide further assistance soon. Thank you for your patience and help in improving YOLOv5! 🚀✨

Answer 2 · 2024-11-16T01:01:58.000Z

@tobymuller233 thank you for reporting this potential issue with loss computation. You've identified an important edge case where predictions with zero width or height could cause NaN values during CIoU loss calculation.

Before proceeding with a PR, please verify this behavior using the latest version of YOLOv5 as there have been several loss computation improvements. If you can provide a minimal reproducible example (MRE) following our MRE guide, it would help us investigate the issue more effectively.

For now, you could add a small epsilon value to prevent division by zero in the height calculations. However, we should also investigate why the model is predicting zero-sized bounding boxes during training, as this may indicate other underlying issues with the training process or data.

If you'd like to submit a PR, please ensure it includes:

The MRE demonstrating the issue
Your proposed fix
Test cases verifying the solution