keras-team/keras-cv

Bad YOLOv8 performance when compared to Ultralytics implementation

Opened this issue · 6 comments

Current Behavior:

We are currently seeing bad performance when it comes to both classification and box prediction on our Keras implementation of the model. When training over 10 epochs and evaluating on the same dataset, we see mAP 50 as low as 0.05 compared to Ultralytics scoring 0.94. Both models are not using preset weights, on the yolo_v8_s backbone. Even with 50-60 epochs of training, the performance of the Keras model doesn't seem right.

Expected Behavior:

YOLOv8 detector being able to score similar results and score the same performance.

Steps To Reproduce:

model = keras_cv.models.YOLOV8Detector(
        num_classes=len(class_mapping),
        bounding_box_format=xyxy,
        backbone="yolo_v8_xs_backbone",
        fpn_depth=2,
)

optimizer = keras.optimizers.SGD(
        learning_rate=lr, momentum=0.9, global_clipnorm=10.0
)

model.compile(
        optimizer=optimizer,
        classification_loss="binary_crossentropy",
        box_loss="ciou",
)

model.fit(
        train_ds,
        epochs=10,
        callbacks=[callbacks],
        validation_data=val_ds,
)

Version:

Tensorflow 2.16.1
Keras 3.2.0 (had to revert due to issue #2421)
Keras-CV 0.8.2

Anything else:

Followed tutorials such as: https://keras.io/guides/keras_cv/object_detection_keras_cv/
Issues such as #2333 and #2353 do concern me.

Full implementation: https://github.com/jagogardiner/Doodlecode

real, me too bro

Have you considered using a different backbone for your model? The pretrained models from Ultralytics, such as those available in the YOLO series, are typically trained on datasets like COCO or Open Images V7.

Have you considered using a different backbone for your model? The pretrained models from Ultralytics, such as those available in the YOLO series, are typically trained on datasets like COCO or Open Images V7.

I have tried a few backbones:

  • yolo_v8_xs_backbone(&coco)
  • yolo_v8_s_backbone(&coco)
  • yolo_v8_m_backbone(&coco)

Going up to the larger backbones was not viable as I could only train it on my personal RTX 3070. However, I used both pretrained and untrained weights when testing the Ultralytics implementation, and both give me better results than the Keras implementation.

I honestly might be making a mistake somewhere, the performance just doesn't seem right to me. But even with the COCO pretrained weights, I still witness poor performance.

I did also try some different backbones from the documentation (https://keras.io/api/keras_cv/models/tasks/yolo_v8_detector/) but for some reason a lot of the models such as resnet or efficientnet won't seem to work, with Keras giving me an error that the backbone is not valid.

Thanks!

I've replicated this with the KerasCV object detection example on PascalVOC with the tensorflow backend (Keras 3.3.3, Keras-CV 0.9.0, TF 2.16.1) even the pre-trained model has a high loss and a mAP of 0.005. Training for 50 epochs makes the predictions visibly worse.

I've created a colab notebook where you can test this out - https://colab.research.google.com/drive/1tsRAHGZkifmQCWqYaMUYREOCTPrBfgS0?usp=sharing

Two other issues of note, the jax backend is much slower than tf for object detection and the model throws an error for any losses other than CIoU.

I personally would advise you to doubt the correctness of PyCocoCallback you are possibly using in your code.

I opened an issue several weeks back, however, it has not been resolved yet, nor was it closed. Especially, if the number of unique labels in your dataset is only a few, or there is significant imbalance in the underlying class distribution, the current implementation might incorrectly encode the ground truth, henceforth, the inevitable erroneous evaluation metric readings...

Note that, even when this issue is resolved (which is very easy to resolve), it still does not guarantee the correctness of yolov8 implementation in keras-cv, and concerns were raised on all that here, which are still not addressed. 🤷🏿

I've replicated this with the KerasCV object detection example on PascalVOC with the tensorflow backend (Keras 3.3.3, Keras-CV 0.9.0, TF 2.16.1) even the pre-trained model has a high loss and a mAP of 0.005. Training for 50 epochs makes the predictions visibly worse.

I've created a colab notebook where you can test this out - https://colab.research.google.com/drive/1tsRAHGZkifmQCWqYaMUYREOCTPrBfgS0?usp=sharing

Two other issues of note, the jax backend is much slower than tf for object detection and the model throws an error for any losses other than CIoU.

Glad to know the same issues are with PascalVOC. I do think there must be some error in the implementation of Keras’ YoloV8, as on my exact same dataset we tested with Ultralytics the predictions are infinitely better.

I personally would advise you to doubt the correctness of PyCocoCallback you are possibly using in your code.

I opened an issue several weeks back, however, it has not been resolved yet, nor was it closed. Especially, if the number of unique labels in your dataset is only a few, or there is significant imbalance in the underlying class distribution, the current implementation might incorrectly encode the ground truth, henceforth, the inevitable erroneous evaluation metric readings...

Note that, even when this issue is resolved (which is very easy to resolve), it still does not guarantee the correctness of yolov8 implementation in keras-cv, and concerns were raised on all that here, which are still not addressed. 🤷🏿

I have seen these issues too. I could try testing with a fix, but unfortunately the metrics don’t make much of a difference. Even when just evaluating the model by inputting images to predict, you can just tell the model is not learning properly. The classification is way out, and the boxes go absolutely haywire until around a confidence of 0.51. Even then, it would still pick up empty space on our drawings as objects.

I understand YOLO is not particularly engineered towards static image object detection tasks, and is preferable when using video, but I would expect it to be able to work it out somewhat.