JustasBart/yolov8_CPP_Inference_OpenCV_ONNX

YOLOv8 output

Marco-Nguyen opened this issue · 1 comments

Hi, I am having hard time understanding how you handle YOLOv8 outouts. How do you get the class id and the confidence from the output of YOLOv8, which is something like (batchSize, 84, 8400)?

Hi @Marco-Nguyen so the first thing that happens is that it checks for this:

// yolov5 has an output of shape (batchSize, 25200, 85) (Num classes + box[x,y,w,h] + confidence[c])
// yolov8 has an output of shape (batchSize, 84,  8400) (Num classes + box[x,y,w,h])
if (dimensions > rows) // Check if the shape[2] is more than shape[1] (yolov8)

From this it will know if whether or not the ONNX model is a yolov8 model or a yolov5 model.

The next thing that happens if it indeed is a yolov8 model is that it will transpose the model from (1, 84, 8400) to (1, 8400, 84) such as to match the yolov5 model structure.

rows = outputs[0].size[2];
dimensions = outputs[0].size[1];
outputs[0] = outputs[0].reshape(1, dimensions);
cv::transpose(outputs[0], outputs[0]);

Note that we are also updating (Switching) the rows and the dimensions variables to match this transposition.

At this point it loops through all of the rows (8400 for yolov8 and 25200 for yolov5) and it accesses the dimensions (84 for yolov8 and 85 for yolov5) that is it loops 8400 or 25200 each time incrementing by either 84 or 85 which is like a data packet for each possible detection (This is the You Only Look Once part).

It then for yolov5 simply checks the 4'th index of the dimension 85 (Num classes + box[x,y,w,h] + confidence[c]) to see if it's confident enough that there is a detection there at all and then if it there is it tries to check if it's confident enough in terms of which class does that detection belong to:

if (confidence >= modelConfidenseThreshold)
{
    float *classes_scores = data+5;

    cv::Mat scores(1, classes.size(), CV_32FC1, classes_scores);
    cv::Point class_id;
    double max_class_score;

    minMaxLoc(scores, 0, &max_class_score, 0, &class_id);

    if (max_class_score > modelScoreThreshold)
    {
        confidences.push_back(confidence);
        class_ids.push_back(class_id.x);

Note that at this point float *classes_scores = data+5; it looks at all the confidences for each of the 80 classes.

This is where the key difference of yolov8 can be observed as yolov8 does not have this confidence dimension.
Instead it jumps straight into this part float *classes_scores = data+4; that checks all of the individual class confidences and if the maximum found confidence is over the threshold (For each individual possible detection) then it adds this potential detection to the vector to be processed by the NMS.

So therefore note that yolov5 does this confidences.push_back(confidence);
Whereas yolov8 does this confidences.push_back(maxClassScore);

Because it doesn't have that confidence dimension (84 vs 85).

And then the final step is the NMS:

std::vector<int> nms_result;
cv::dnn::NMSBoxes(boxes, confidences, modelScoreThreshold, modelNMSThreshold, nms_result);

Which essentially tries to make sense of the final detection boxes in terms of figuring out the overlap and the final actual class ID etc...

Feel free to follow up on this as there's a lot to go through but this should answer the differences between yolov5 and yolov8 and how they both are processed.

Good luck! 🚀